perf(newscatcher_fetcher): Split URLs into batches and fetch with mul…

counterdata-network / story-processor

Story discovery engine for the Counterdata Network. Grabs relevant stories from various APIs, runs them against bespoke classifier models, post results to a central server.

Apache License 2.0

0 stars 2 forks source link

perf(newscatcher_fetcher): Split URLs into batches and fetch with mul… #67

Closed math4humanities closed 3 months ago

math4humanities commented 4 months ago

Split URLs into batches and fetch with multiple spiders concurrently using twisted deferred list

Supporting function run_spider and modified fetch_all_html in fetcher.py. Note I modified pyproject + pre-commit config to reflect updates

Fetching time should be cut 40-60%

rahulbot commented 4 months ago

Interesting. Have you tried doing a longer run on your dev machine with this approach? Ie. Have you done a full run of newscatcher of wayback machine locally with this branch?

Also: where do you set the max number of spiders to use?

math4humanities commented 4 months ago

Interesting. Have you tried doing a longer run on your dev machine with this approach? Ie. Have you done a full run of newscatcher of wayback machine locally with this branch?

Also: where do you set the max number of spiders to use?

Yes, but to achieve a smooth run, we should dramatically increase the batch size. Currently, the number of spiders used are dependent on the batch size, but I can easily rewrite it so conversely the batch size depends on a set number of maximum spiders. The current batch size set is arbitrary, and probably should be more reflective of our expected workload.

math4humanities commented 4 months ago

I've rewritten the function to now take a set number of spiders as input, where I've set the default to 4. This is a more uniform approach, and the performance hasn't been majorly impacted. I've completed a long run, and it should work as expected.

rahulbot commented 4 months ago

This is a large change, so I want to try it out on my dev machine as well before releasing, and supervise it's first run on prod as well to make sure it performs well. I'll revisit this next week.