BuilderIO / gpt-crawler

Crawl a site to generate knowledge files to create your own custom GPT from a URL
https://www.builder.io/blog/custom-gpt
ISC License
18.14k stars 1.88k forks source link

Add support for concurrent invocations to crawl #120

Open adityak74 opened 6 months ago

adityak74 commented 6 months ago

PlaywrightCrawler creates a lock file and fails when the crawl is invoked concurrently. There is a property running for the class through which we can validate if there is an instance running. We should spawn a process with Playwright to resolve the crawl job.

@BikeF

isarikaya commented 5 months ago

Hey @adityak74,

I made some attempts to solve this problem, but I was not successful. Any progress on your side?

adityak74 commented 5 months ago

@isarikaya can you add some details on what you tried out? It will help me to investigate. But I haven't found a solution yet.

isarikaya commented 5 months ago

@adityak74 As far as I remember I tried the following:

-Storage is created after the first request. A new request doesn't matter if there is storage. So I tried clearing after the request completed. https://crawlee.dev/docs/guides/request-storage#cleaning-up-the-storages https://stackoverflow.com/questions/74709844/how-to-reset-crawlee-url-cache

-maxRequestsPerCrawl https://crawlee.dev/api/playwright-crawler/interface/PlaywrightCrawlerOptions#maxRequestsPerCrawl

I think here are the things that will do what we want so I'll try to integrate them into the existing code. https://crawlee.dev/docs/guides/parallel-scraping