#51 - scrape in parallel

djay / covidthailand

Thailand Covid testing and case data gathered and combined from various sources for others to download or view

126 stars 15 forks source link

#51 - scrape in parallel #103

Closed reduxionist closed 2 years ago

reduxionist commented 2 years ago

don't worry about the logging stuff, that can be optimized after we determine if the MP is worth it...

reduxionist commented 2 years ago

So already one clear benefit from switching to logging: timestamps in the log entries show us how long we've stuck on a a particular step for. There's plenty of cases when we're repeated the date value twice, but I thought it best to keep the output string identical (barring the switch to lazy evaluation of params/f-string constructs since that could be another source of potential speedup once we take advantage of setting log levels appropriately). The third-party logging library I used is MP compatible OOTB, but for use with print progress indicators (the way we output "."'s on progress), their instructions for TQDM progress indicators are more promising than the ones for print progress indicators that I followed for now. I believe the simplest approach would be to group log output into different files, keeping progress indicators (and their section headings) only left to stdout. (That or fix my first-timer's use of .bind.raw method chain on loguru's logger... :wink:)

reduxionist commented 2 years ago

it's gotten better but i still got a deadlock when i went back far enough (100 days).

reduxionist commented 2 years ago

Fantastic that it's passing the build. I suspect the low concurrency count is actually helping us avoid the deadlock. I do notice some log formatting errors (e.g. WARNING headers that aren't followed by the appropriate line of warning data), and a lot of formatting optimisation that could be done. But now that everything is a logging call, we should be able to prioritize outputs into appropriate DEBUG, WARN, ERROR, levels and I highly recommend we dump all off the DEBUG, NOTICE, INFO messages -- at least -- to files while restricted stderr output to real problems only at first. This way we can still see what happened after a run, but IME, printing output is much slower than dumping to file on chatty logs like these in python...

djay commented 2 years ago

@reduxionist does the dashboard stuff take longer than everything else? if so then it can be broken up into seperate tasks. each could import the csv they are adding to seperatly and then combined using combine_first and exported after everything is done. I'm pretty sure they are adding different data so should result in the same dataframe in the end.

reduxionist commented 2 years ago

เมื่อ พ. 6 ต.ค. 2564 เวลา 10:55 Dylan Jay @.***> เขียนว่า:

@reduxionist https://github.com/reduxionist does the dashboard stuff take longer than everything else?

Yes, it does.

if so then it can be broken up into seperate tasks. each could import the csv they are adding to seperatly and then combined using combine_first and exported after everything is done. I'm pretty sure they are adding different data so should result in the same dataframe in the end.

I will have a try at implemnting this then!

djay commented 2 years ago

@reduxionist so can merge right?

reduxionist commented 2 years ago

It LGTM so I'd say yes...

On Mon, Oct 11, 2021 at 10:43 AM Dylan Jay @.***> wrote:

@reduxionist https://github.com/reduxionist so can merge right?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/djay/covidthailand/pull/103#issuecomment-939653755, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAD2IJ6GGVQW64L56GJJVFLUGJMMVANCNFSM5FJZ64RA .