Open serpent213 opened 1 year ago
Hey @serpent213,
Indeed, Crawly is built around the spider names, and there is no easy way to switch to something else right now.
However, it may be the case that you don't need it. Let me try to explain my points here:
When I think about your case, as I understand it, you want a broad crawl that goes to multiple websites and extracts all the information from them. Probably there is some scheduler outside Crawly that just does something like:
Crawly.Engine.start_spider(BroadCrawler, start_urls: "google.com")
Crawly.Engine.start_spider(BroadCrawler, start_urls: "openai.com")
May it be the case, that you need an API to add extra requests to an already running spider?
When I think about your case, as I understand it, you want a broad crawl that goes to multiple websites and extracts all the information from them. Probably there is some scheduler outside Crawly that just does something like:
Crawly.Engine.start_spider(BroadCrawler, start_urls: "google.com") Crawly.Engine.start_spider(BroadCrawler, start_urls: "openai.com")
Exactly, that was my first attempt. Thank you for the inspiration, will look into it!
API to add extra requests to an already running spider @oltarasenko
How can I do this? I have a spider running and would like to add more urls/requests for it to scrape
My application requires basically only one spider, but I would like to run many instances in parallel. I was assuming that to be possible using the crawl_id.
But now I'm not so sure anymore, the dispatching seems to be based on the spider's name mainly.
What would it take to make that work?