Closed math4humanities closed 3 months ago
Interesting. Have you tried doing a longer run on your dev machine with this approach? Ie. Have you done a full run of newscatcher of wayback machine locally with this branch?
Also: where do you set the max number of spiders to use?
Interesting. Have you tried doing a longer run on your dev machine with this approach? Ie. Have you done a full run of newscatcher of wayback machine locally with this branch?
Also: where do you set the max number of spiders to use?
Yes, but to achieve a smooth run, we should dramatically increase the batch size. Currently, the number of spiders used are dependent on the batch size, but I can easily rewrite it so conversely the batch size depends on a set number of maximum spiders. The current batch size set is arbitrary, and probably should be more reflective of our expected workload.
I've rewritten the function to now take a set number of spiders as input, where I've set the default to 4. This is a more uniform approach, and the performance hasn't been majorly impacted. I've completed a long run, and it should work as expected.
This is a large change, so I want to try it out on my dev machine as well before releasing, and supervise it's first run on prod as well to make sure it performs well. I'll revisit this next week.
Split URLs into batches and fetch with multiple spiders concurrently using twisted deferred list
Supporting function run_spider and modified fetch_all_html in fetcher.py. Note I modified pyproject + pre-commit config to reflect updates
Fetching time should be cut 40-60%