diegov / searchbox

Personal crawling and indexing
GNU General Public License v3.0
2 stars 0 forks source link

UberSpider and decoupling spiders from Scrapy #36

Open diegov opened 1 year ago

diegov commented 1 year ago

The only parts of Scrapy that we take advantage of are the scheduler and the downloader. Its management of crawlers and spiders doesn't add anything to our usecase, and the abstractions provided add confusion.

Eg. in the Spider class, start_requests can be overriden on its own, as long as every request includes a callback. If some of the requests don't include a callback, then parse must also be overridden (it throws by default). If start_requests is not overridden, then the start_urls attribute must be set.
It would be good to isolate our spiders from all this logic, since our behaviours are much simpler. Our requests always include a callback, so the real interface of our spiders is simply Iterable[Request], representing start_requests. That's the only spider interface method Scrapy interacts with, everything else happens through request callbacks.

The fully independent state management that scrapy imposes on the crawlers also makes it difficult to share components between them without resorting to global variables.

To simplify the implementation of our spiders, the UberSpider would be the only one implementing scrapy.Spider. There would be only one instance of it, which would allow us to manage the lifecycle of components that are shared by all spiders.

Instead of fully constructed http requests, spiders could return a higher level representation that can be routed between the different spiders. Eg. if a pocket favourite points to a GitHub repository, the UberSpider could route that to the GitHub spider. The final spider assigned would then produce the full request to be handed over to scrapy.

diegov commented 4 months ago

Note 2347f7d7f5d67ac3cf30d15e6145efcae3f90143 added the option to re-route requests between spiders, without an UberSpider implementation.