newscatcher fetching with scrapy is too slow

rahulbot commented 11 months ago

The scrapy integration with the newscatcher fetcher maxes out around 50 URLs/minute. This is insufficient for our needs, but changing the throttling variables and such doesn't seem to increase it. I think we need to investigate this more by first learning more about the settings.

My hypothesis is that the solution is to parallelize the fetching. That would involve taking all the URLs and chunking them into N equal-sized lists, spinning up a scrapy sprider/crawler for each, and then running those in parallel. Some relevant notes.

rahulbot commented 11 months ago

Fetching lists of URLS from newscatcher is fast; the slow part is the 3rd stage of the pipeline in fetch_text. You can quickly do a smaller test run locally by tweaking behavior in scripts.queue_newscatcher_stories.py:

limit to fetching stories for just one project (look at the commented out lines in load_projects
change MAX_STORIES_PER_PROJECT and PAGE_SIZEto be smaller Then run ./run-fetch-newscatcher.sh to execute the script; you can monitor via console log messages.

Another note is that if you want to run the worker (eg. ./run-workers.sh) without posting stories to the main server you should set processor.classiers.REALLY_POST to False.

rahulbot commented 11 months ago

Here's an example of the scrapy stats that are very useful for assessing the speed. See in particular the request_count and elapsed_time_seconds values.

07:18:04.212 | INFO    | scrapy.statscollectors - Dumping Scrapy stats:
{'downloader/exception_count': 9320,
 'downloader/exception_type_count/twisted.internet.error.ConnectError': 7,
 'downloader/exception_type_count/twisted.internet.error.ConnectionRefusedError': 4,
 'downloader/exception_type_count/twisted.internet.error.TimeoutError': 4129,
 'downloader/exception_type_count/twisted.web._newclient.ResponseFailed': 60,
 'downloader/exception_type_count/twisted.web._newclient.ResponseNeverReceived': 5120,
 'downloader/request_bytes': 25051955,
 'downloader/request_count': 68863,
 'downloader/request_method_count/GET': 68863,
 'downloader/response_bytes': 2356023933,
 'downloader/response_count': 59543,
 'downloader/response_status_count/200': 49816,
 'downloader/response_status_count/301': 7919,
 'downloader/response_status_count/302': 334,
 'downloader/response_status_count/303': 3,
 'downloader/response_status_count/307': 18,
 'downloader/response_status_count/308': 70,
 'downloader/response_status_count/401': 2,
 'downloader/response_status_count/403': 183,
 'downloader/response_status_count/404': 200,
 'downloader/response_status_count/405': 6,
 'downloader/response_status_count/408': 1,
 'downloader/response_status_count/429': 133,
 'downloader/response_status_count/500': 802,
 'downloader/response_status_count/502': 9,
 'downloader/response_status_count/503': 31,
 'downloader/response_status_count/504': 12,
 'downloader/response_status_count/521': 1,
 'downloader/response_status_count/522': 2,
 'downloader/response_status_count/525': 1,
 'elapsed_time_seconds': 65763.819339,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2023, 10, 12, 7, 18, 4, 210749, tzinfo=datetime.timezone.utc),
 'httpcompression/response_bytes': 10543515619,
 'httpcompression/response_count': 48322,
 'httperror/response_ignored_count': 716,
 'httperror/response_ignored_status_count/401': 2,
 'httperror/response_ignored_status_count/403': 183,
 'httperror/response_ignored_status_count/404': 199,
 'httperror/response_ignored_status_count/405': 6,
 'httperror/response_ignored_status_count/429': 35,
 'httperror/response_ignored_status_count/500': 268,
 'httperror/response_ignored_status_count/502': 3,
 'httperror/response_ignored_status_count/503': 10,
 'httperror/response_ignored_status_count/504': 7,
 'httperror/response_ignored_status_count/521': 1,
 'httperror/response_ignored_status_count/522': 1,
 'httperror/response_ignored_status_count/525': 1,
 'log_count/ERROR': 3619,
 'log_count/INFO': 1822,
 'log_count/WARNING': 1486,
 'memusage/max': 1209876480,
 'memusage/startup': 557035520,
 'response_received_count': 50508,
 'retry/count': 8485,
 'retry/max_reached': 1825,
 'retry/reason_count/408 Request Time-out': 1,
 'retry/reason_count/429 Unknown Status': 98,
 'retry/reason_count/500 Internal Server Error': 534,
 'retry/reason_count/502 Bad Gateway': 6,
 'retry/reason_count/503 Service Unavailable': 21,
 'retry/reason_count/504 Gateway Time-out': 5,
 'retry/reason_count/522 Unknown Status': 1,
 'retry/reason_count/twisted.internet.error.ConnectError': 5,
 'retry/reason_count/twisted.internet.error.ConnectionRefusedError': 3,
 'retry/reason_count/twisted.internet.error.TimeoutError': 3475,
 'retry/reason_count/twisted.web._newclient.ResponseFailed': 40,
 'retry/reason_count/twisted.web._newclient.ResponseNeverReceived': 4296,
 'scheduler/dequeued': 68863,
 'scheduler/dequeued/memory': 68863,
 'scheduler/enqueued': 68863,
 'scheduler/enqueued/memory': 68863,
 'spider_exceptions/BadContentError': 161,
 'start_time': datetime.datetime(2023, 10, 11, 13, 2, 0, 391410, tzinfo=datetime.timezone.utc)}

dataculturegroup / feminicide-story-processor

newscatcher fetching with scrapy is too slow #30