CADWRDeltaModeling / dms_datastore

Data download and management tools for continuous data for Pandas. See documentation https://cadwrdeltamodeling.github.io/dms_datastore/
https://cadwrdeltamodeling.github.io/dms_datastore/
MIT License
1 stars 0 forks source link

multiprocessing is not thread/processor safe with regard to list of failures and skips #32

Open water-e opened 6 months ago

water-e commented 6 months ago

I think to make this work the main routines will have to get rid of those as list inputs (which are then appended as a side effect). No need for an argument, but they will have to be outputed on completion of the job and appended to a master list. Probably wouldn't hurt to sort the master lists so they are predictablly arranged.

dwr-psandhu commented 6 months ago

Agreed, the methods will return failures and skips which can then be appended to the master list after returning from the futures. I can sort the master list but what should i use as the sort key ?

water-e commented 6 months ago

Alphanumeric? My only concern is predictability. My concern is that the order of the master list would be dependent on which parallel call returns first, which might ultimatemately affect any diffing we might do. I can't imagine that as_completed yields processes in any order except the order they complete and that is going to be I/O or latency dependent -- upredictable return order would amount to shuffling the master list.

water-e commented 5 months ago

Actually now that I look at this it seems like this is potentially a complex issue.

It appears we are using ThreadPoolExecutor rather than ProcessPoolExecutor. I believe that means we are using the equivalent of multithreading rather than multiprocessing. If so, logging would be safe.

Using ThreadPoolExecutor is probably OK for the downloaders, since they are the classic web-IO-bound example. Using it for reformat is more ambiguous since my understanding is that threads don't always help for file I/O bound examples. It would be interesting to test Thread vs Process. That should probably be done in a local disk environment, because it seems (another issue) that disk locality is brutally important for reformat. For that "the only way to win is not to play" which would mean not reformatting old files. Since the formatting does change, we would have to do it from time to time.

I'll experiment a bit. I think the real limiters need to be studied in a disciplined way.