BuilderIO / gpt-crawler

Crawl a site to generate knowledge files to create your own custom GPT from a URL
https://www.builder.io/blog/custom-gpt
ISC License
18.15k stars 1.88k forks source link

Crawling more than max number of pages #118

Open dcgleason opened 6 months ago

dcgleason commented 6 months ago

Hi, awesome tool.

I'm trying to crawl a large number of pages (100,000) and the max number of pages I set is being exceeded, it is curertly at 130,00 pages or so, b/c more than 100k pages fit the url (/**) I put in the conditions.

Is there a feature to stop the crawling process without losing the pages already crawled?

Also, what alternatives exist for training such any any model on such a large data set? Assume this, when finished, will certainly exceed the limit for OpenAi's assistant and the GPT creator.

Thanks.