Closed rahulbot closed 1 month ago
Fetching lists of URLS from newscatcher is fast; the slow part is the 3rd stage of the pipeline in fetch_text
. You can quickly do a smaller test run locally by tweaking behavior in scripts.queue_newscatcher_stories.py
:
load_projects
MAX_STORIES_PER_PROJECT
and PAGE_SIZE
to be smaller
Then run ./run-fetch-newscatcher.sh
to execute the script; you can monitor via console log messages.Another note is that if you want to run the worker (eg. ./run-workers.sh
) without posting stories to the main server you should set processor.classiers.REALLY_POST
to False
.
Here's an example of the scrapy stats that are very useful for assessing the speed. See in particular the request_count
and elapsed_time_seconds
values.
07:18:04.212 | INFO | scrapy.statscollectors - Dumping Scrapy stats:
{'downloader/exception_count': 9320,
'downloader/exception_type_count/twisted.internet.error.ConnectError': 7,
'downloader/exception_type_count/twisted.internet.error.ConnectionRefusedError': 4,
'downloader/exception_type_count/twisted.internet.error.TimeoutError': 4129,
'downloader/exception_type_count/twisted.web._newclient.ResponseFailed': 60,
'downloader/exception_type_count/twisted.web._newclient.ResponseNeverReceived': 5120,
'downloader/request_bytes': 25051955,
'downloader/request_count': 68863,
'downloader/request_method_count/GET': 68863,
'downloader/response_bytes': 2356023933,
'downloader/response_count': 59543,
'downloader/response_status_count/200': 49816,
'downloader/response_status_count/301': 7919,
'downloader/response_status_count/302': 334,
'downloader/response_status_count/303': 3,
'downloader/response_status_count/307': 18,
'downloader/response_status_count/308': 70,
'downloader/response_status_count/401': 2,
'downloader/response_status_count/403': 183,
'downloader/response_status_count/404': 200,
'downloader/response_status_count/405': 6,
'downloader/response_status_count/408': 1,
'downloader/response_status_count/429': 133,
'downloader/response_status_count/500': 802,
'downloader/response_status_count/502': 9,
'downloader/response_status_count/503': 31,
'downloader/response_status_count/504': 12,
'downloader/response_status_count/521': 1,
'downloader/response_status_count/522': 2,
'downloader/response_status_count/525': 1,
'elapsed_time_seconds': 65763.819339,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2023, 10, 12, 7, 18, 4, 210749, tzinfo=datetime.timezone.utc),
'httpcompression/response_bytes': 10543515619,
'httpcompression/response_count': 48322,
'httperror/response_ignored_count': 716,
'httperror/response_ignored_status_count/401': 2,
'httperror/response_ignored_status_count/403': 183,
'httperror/response_ignored_status_count/404': 199,
'httperror/response_ignored_status_count/405': 6,
'httperror/response_ignored_status_count/429': 35,
'httperror/response_ignored_status_count/500': 268,
'httperror/response_ignored_status_count/502': 3,
'httperror/response_ignored_status_count/503': 10,
'httperror/response_ignored_status_count/504': 7,
'httperror/response_ignored_status_count/521': 1,
'httperror/response_ignored_status_count/522': 1,
'httperror/response_ignored_status_count/525': 1,
'log_count/ERROR': 3619,
'log_count/INFO': 1822,
'log_count/WARNING': 1486,
'memusage/max': 1209876480,
'memusage/startup': 557035520,
'response_received_count': 50508,
'retry/count': 8485,
'retry/max_reached': 1825,
'retry/reason_count/408 Request Time-out': 1,
'retry/reason_count/429 Unknown Status': 98,
'retry/reason_count/500 Internal Server Error': 534,
'retry/reason_count/502 Bad Gateway': 6,
'retry/reason_count/503 Service Unavailable': 21,
'retry/reason_count/504 Gateway Time-out': 5,
'retry/reason_count/522 Unknown Status': 1,
'retry/reason_count/twisted.internet.error.ConnectError': 5,
'retry/reason_count/twisted.internet.error.ConnectionRefusedError': 3,
'retry/reason_count/twisted.internet.error.TimeoutError': 3475,
'retry/reason_count/twisted.web._newclient.ResponseFailed': 40,
'retry/reason_count/twisted.web._newclient.ResponseNeverReceived': 4296,
'scheduler/dequeued': 68863,
'scheduler/dequeued/memory': 68863,
'scheduler/enqueued': 68863,
'scheduler/enqueued/memory': 68863,
'spider_exceptions/BadContentError': 161,
'start_time': datetime.datetime(2023, 10, 11, 13, 2, 0, 391410, tzinfo=datetime.timezone.utc)}
The scrapy integration with the newscatcher fetcher maxes out around 50 URLs/minute. This is insufficient for our needs, but changing the throttling variables and such doesn't seem to increase it. I think we need to investigate this more by first learning more about the settings.
My hypothesis is that the solution is to parallelize the fetching. That would involve taking all the URLs and chunking them into N equal-sized lists, spinning up a scrapy sprider/crawler for each, and then running those in parallel. Some relevant notes.