BuilderIO / gpt-crawler

Crawl a site to generate knowledge files to create your own custom GPT from a URL
https://www.builder.io/blog/custom-gpt
ISC License
18.15k stars 1.88k forks source link

How to search all URLs with a certain word in it #85

Open nynewco opened 7 months ago

nynewco commented 7 months ago

How to search all URLs with a certain word in it?

Eg. the word usa anywhere within any url within ft.com site?

ft.com/usa/xyz ft.com/today/opinion/usa ft.com/today/articles/usa

Is this the Selector? If so, how do you do it?

bigshirtjonny commented 7 months ago

ft.com/*usa*

nynewco commented 6 months ago

I tried 2 versions, but none of those worked per your recommendation. Did I get something wrong?

url: "https://www.ft.com", match: "ft.com/usa",

url: "https://www.ft.com/", match: "ft.com/usa",

Daethyra commented 6 months ago

I tried 2 versions, but none of those worked per your recommendation. Did I get something wrong?

url: "https://www.ft.com", match: "ft.com/usa",

url: "https://www.ft.com/", match: "ft.com/usa",

You need to use regex pattern matching inside the match value.

import { Config } from "./src/config";

export const defaultConfig: Config = {
  url: "https://api.python.langchain.com/en/stable/langchain_api_reference.html",
  match: "https://api.python.langchain.com/en/stable/**",
  maxPagesToCrawl: 50,
  outputFileName: "output.json",
};

So in your case, it would be:


import { Config } from "./src/config";

export const defaultConfig: Config = {
  url: "https://www.ft.com/",
  match: "https://www.ft.com/**",
  maxPagesToCrawl: 50,
  outputFileName: "output.json",
};
nynewco commented 6 months ago

This worked great!

How do I get it to search all the other 13 or 14 different pages (see bottom of page) on a website such as this ...

https://midlibrary.io/categories/illustrators

I noticed that pages 2 and 3 and on are like this:

https://midlibrary.io/categories/illustrators?226b50f6_page=2 https://midlibrary.io/categories/illustrators?226b50f6_page=3 https://midlibrary.io/categories/illustrators?226b50f6_page=4

Daethyra commented 5 months ago

The way you're writing it is inefficient for the lack of wildcard usage.

By the way you asked for help, the solution would be:

import { Config } from "./src/config";

export const defaultConfig: Config = {
  url: "https://midlibrary.io/categories/illustrators",
  match: [
    "https://midlibrary.io/categories/illustrators?226b50f6_page=2",
    "https://midlibrary.io/categories/illustrators?226b50f6_page=3",
    "https://midlibrary.io/categories/illustrators?226b50f6_page=4",
  ],
  maxPagesToCrawl: 50,
  outputFileName: "output.json",
};

But because the match URLs all contain 226b50f6_page then you should just be using wildcards:

import { Config } from "./src/config";

export const defaultConfig: Config = {
  url: "https://midlibrary.io/categories/illustrators",
  match: "https://midlibrary.io/categories/illustrators?226b50f6_page=**",
  maxPagesToCrawl: 50,
  outputFileName: "output.json",
};

You need to learn pattern matching.