Open emigre459 opened 4 days ago
This may be helpful (for async stuff, but I can't figure it out myself, seems like they may have skipped some steps...).
Not sure about the best number to do simultaneously, so I arbitrarily limited to the number of CPUs on the host machine - 1
similar to what was done in the VESPID project pipeline parallelization code (but that code was processor-limited, not I/O-limited like this...).
However this is probably not the worst limiting factor and will still speed things up ~10x probably, based on vCPUs.
Interesting to note that usually among my three parallel scraping windows (scraping the exact same location) there were two that would take a long time to load up before proceeding. Still not sure what causes that...
Something like
ray
should be a viable option to deploy multiple selenium calls at once. For the purposes of our tasks (wherein scraping locations is a separate task from scraping location-ID-specific data), it will be good to have a simple parallelizing wrapper class that assumes you're giving it a data source to be split into multiple processes and indices that indicate how to split it (e.g. IDs 0:999 in process_1, 1000:2000 in process_2, etc.).main
)