BuilderIO / gpt-crawler

Crawl a site to generate knowledge files to create your own custom GPT from a URL
https://www.builder.io/blog/custom-gpt
ISC License
18.15k stars 1.88k forks source link

How to search sub-folders e.g. xyz.com/folder/page1, page 2 etc. #84

Open nynewco opened 7 months ago

nynewco commented 7 months ago

How do you search for all the sub folders. e.g.

https://thoughtcatalog.com/trisha-bartle/ https://thoughtcatalog.com/trisha-bartle/page/2/ https://thoughtcatalog.com/trisha-bartle/page/3/ etc.

SamToukin commented 7 months ago

How do you search for all the sub folders. e.g.

https://thoughtcatalog.com/trisha-bartle/ https://thoughtcatalog.com/trisha-bartle/page/2/ https://thoughtcatalog.com/trisha-bartle/page/3/ etc.

To search for all subdirectories of the pages, you can use the following expression:

https://thoughtcatalog.com/trisha-bartle/page/**

This will encompass all pages, regardless of the specific page number, using ** as a wildcard to represent all possible variations.

nynewco commented 7 months ago

I tried this approach using your exact expression

export const defaultConfig: Config = { url: "https://thoughtcatalog.com", match: "https://thoughtcatalog.com/trisha-bartle/page/**", maxPagesToCrawl: 10, outputFileName: "output.json", };

and it failed:

INFO PlaywrightCrawler: Starting the crawler. INFO PlaywrightCrawler: Crawling: Page 1 / 10 - URL: https://thoughtcatalog.com/... INFO PlaywrightCrawler: All requests from the queue have been processed, the crawler will shut down. INFO PlaywrightCrawler: Final request statistics: {"requestsFinished":1,"requestsFailed":0,"retryHistogram":[1],"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":3102,"requestsFinishedPerMinute":18,"requestsFailedPerMinute":0,"requestTotalDurationMillis":3102,"requestsTotal":1,"crawlerRuntimeMillis":3359} INFO PlaywrightCrawler: Finished! Total 1 requests: 1 succeeded, 0 failed. {"terminal":true}

SamToukin commented 7 months ago

I tried this approach using your exact expression

export const defaultConfig: Config = { url: "https://thoughtcatalog.com", match: "https://thoughtcatalog.com/trisha-bartle/page/**", maxPagesToCrawl: 10, outputFileName: "output.json", };

and it failed:

INFO PlaywrightCrawler: Starting the crawler. INFO PlaywrightCrawler: Crawling: Page 1 / 10 - URL: https://thoughtcatalog.com/... INFO PlaywrightCrawler: All requests from the queue have been processed, the crawler will shut down. INFO PlaywrightCrawler: Final request statistics: {"requestsFinished":1,"requestsFailed":0,"retryHistogram":[1],"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":3102,"requestsFinishedPerMinute":18,"requestsFailedPerMinute":0,"requestTotalDurationMillis":3102,"requestsTotal":1,"crawlerRuntimeMillis":3359} INFO PlaywrightCrawler: Finished! Total 1 requests: 1 succeeded, 0 failed. {"terminal":true}

It seems that the current configuration can be adjusted to better suit your specific needs. Upon reviewing your configuration:

export const defaultConfig: Config = {
    url: "https://thoughtcatalog.com",
    match: "https://thoughtcatalog.com/trisha-bartle/page/**",
    maxPagesToCrawl: 10,
    outputFileName: "output.json",
};

The url parameter is currently set to "https://thoughtcatalog.com," which might be too broad if you are specifically looking for results within the "trisha-bartle/page" subdirectory.

I would suggest redefining the url parameter to more closely match the structure you are targeting. You might consider setting the starting URL as follows:

url: "https://thoughtcatalog.com/trisha-bartle/page",

Additionally, the match parameter could be simplified to align with the structure of the links you hope to obtain:

match: "https://thoughtcatalog.com/trisha-bartle/page/**",

These adjustments could enable the crawler to more precisely target links that meet your expectations right from the beginning of the crawling process. Feel free to experiment with these changes.

nynewco commented 6 months ago

Still getting the hang of it.

Trying to find the name of all the artists on this page:

url: "https://midlibrary.io/", match: "https://midlibrary.io/feature/dark/"

Getting an error right off the bat.

BTNGaming commented 6 months ago

So I'm doing the same thing but it's collecting one page.

url: "https://overkillgaming.com/"
match: "https://overkillgaming.com/**"

Something isn't working right as it's only getting the first page and calling it quits with a 2kb json file. Anything i could be doing to fix this issue?

https://i.imgur.com/PW41k9a.png

R-Lek commented 5 months ago

I run into the same issue, I was attempting to run the crawler on this site: help.anva.nl I configured it like this:

export const defaultConfig: Config = { url: "https://help.anva.nl/topic", match: "https://help.anva.nl/topic/**", maxPagesToCrawl: 50, outputFileName: "output.json", maxTokens: 2000000, };

-figuring it would extract all possible pages after /topic/, such as: help.anva.nl/topic/6910/ADN-dataco... help.anva.nl/topic/6910/ADN-dataco... help.anva.nl/topic/49476/Afdruk/(menu) etc.

However, the output I get is: INFO PlaywrightCrawler: Starting the crawler. INFO PlaywrightCrawler: Crawling: Page 1 / 50 - URL: https://help.anva.nl/topic/57578/Welkom_in_de_ANVA_Help... INFO PlaywrightCrawler: Crawling: Page 2 / 50 - URL: https://help.anva.nl/topic/57578/Welkom_in_de_ANVA_Help#... INFO PlaywrightCrawler: Crawling: Page 3 / 50 - URL: https://help.anva.nl/topic/57578/53212.htm... INFO PlaywrightCrawler: Crawling: Page 4 / 50 - URL: https://help.anva.nl/topic/57578/indexpage.htm... INFO PlaywrightCrawler: All requests from the queue have been processed, the crawler will shut down. INFO PlaywrightCrawler: Final request statistics:

I've tried some variations but I can't seem to extract anything more than this.