Insutanto / scrapy-distributed

A series of distributed components for Scrapy. Including RabbitMQ-based components, Kafka-based components, and RedisBloom-based components for Scrapy.
55 stars 11 forks source link

Implementation proposal #1

Open whalebot-helmsman opened 4 years ago

whalebot-helmsman commented 4 years ago

Hi @Insutanto

You doing nice work in this repo. I have the same desire: different message queues should be supported in scrapy.

Old implementations of this idea and one you have here share common disadvantage. For every type of queue you need to implement separate scheduler. Beside amount of work required such implementations can't use work done on improvement of scheduling. I am talking mostly about https://github.com/scrapy/scrapy/pull/3520. The reason for going distributed(at least for me) is a lot of domains in a single crawl. Not using DownloaderAwarePriorityQueue makes crawling slower(like 10 times slower) according to benchmarks in mentioned PR.

To overcome this situation I developed and merged in https://github.com/scrapy/scrapy/pull/3884 separation between logic of scheduler and external message queue.

It would be great for your project and scrapy community if you change from scheduler-based to queue-based.

More details and discussions can be find in https://github.com/scrapy/scrapy/issues/4326. Example of such implementation for redis you can find in https://github.com/whalebot-helmsman/scrapy/blob/redis/scrapy/squeues.py#L101-L173 .

Also there is a PR for external queue protocol https://github.com/scrapy/scrapy/pull/4783

Insutanto commented 4 years ago

Thank you @whalebot-helmsman

I agree with you. It looks so great that we can implement different message queues without implement different schedulers. I am tired of those DRY's problems. 😫 I have read the issues and PRs that your mention, they are very valuable. I will try to use DownloaderAwarePriorityQueue and queue-based implementation. That would be great for me to implement some modules in the future. 😸 In the end, thank you for your contributions to the Scrapy project. 😸

Insutanto commented 4 years ago

Hi @Insutanto

You doing nice work in this repo. I have the same desire: different message queues should be supported in scrapy.

Old implementations of this idea and one you have here share common disadvantage. For every type of queue you need to implement separate scheduler. Beside amount of work required such implementations can't use work done on improvement of scheduling. I am talking mostly about scrapy/scrapy#3520. The reason for going distributed(at least for me) is a lot of domains in a single crawl. Not using DownloaderAwarePriorityQueue makes crawling slower(like 10 times slower) according to benchmarks in mentioned PR.

To overcome this situation I developed and merged in scrapy/scrapy#3884 separation between logic of scheduler and external message queue.

It would be great for your project and scrapy community if you change from scheduler-based to queue-based.

More details and discussions can be find in scrapy/scrapy#4326. Example of such implementation for redis you can find in https://github.com/whalebot-helmsman/scrapy/blob/redis/scrapy/squeues.py#L101-L173 .

Also there is a PR for external queue protocol scrapy/scrapy#4783

Thanks for your proposal !