Parallelize scraping code

emigre459 commented 4 days ago

Something like ray should be a viable option to deploy multiple selenium calls at once. For the purposes of our tasks (wherein scraping locations is a separate task from scraping location-ID-specific data), it will be good to have a simple parallelizing wrapper class that assumes you're giving it a data source to be split into multiple processes and indices that indicate how to split it (e.g. IDs 0:999 in process_1, 1000:2000 in process_2, etc.).

[x] Build wrapper class
[x] Test on main page scraping (as this is already merged to main)
[x] Determine how many parallel processes we can run without bandwidth/refusal issues for a curated set of location IDs we know will be real

emigre459 commented 3 days ago

This may be helpful (for async stuff, but I can't figure it out myself, seems like they may have skipped some steps...).

emigre459 commented 2 days ago

Not sure about the best number to do simultaneously, so I arbitrarily limited to the number of CPUs on the host machine - 1 similar to what was done in the VESPID project pipeline parallelization code (but that code was processor-limited, not I/O-limited like this...).

However this is probably not the worst limiting factor and will still speed things up ~10x probably, based on vCPUs.

emigre459 commented 1 day ago

Interesting to note that usually among my three parallel scraping windows (scraping the exact same location) there were two that would take a long time to load up before proceeding. Still not sure what causes that...

emigre459 / evlens

Parallelize scraping code #13