broadinstitute / imaging-backup-scripts

Scripts to backup data for the Imaging Platform
MIT License
1 stars 3 forks source link

Add requests parallelization and status validation #18

Closed johnarevalo closed 3 years ago

johnarevalo commented 3 years ago

I ran this version to restore 17k files with 1 and with 8 workers. 1 worker took ~2h while 8 workers took ~26min.

Probably 4 workers is the limit for this parallelization strategy.

shntnu commented 3 years ago

Thanks @johnarevalo! Is this ready for review. If so, I think @bethac07 would be it

bethac07 commented 3 years ago

Awesome, thanks for your help with this!

Just an FMI, if you ran it with 8 why do you think 4 is the max (and why is 8 the default in the script?), was it just running into a lot of Slow Down errors?

My concern with having output written only to a log file is that people might never actually check it, especially if the request is thousands or 10s of thousands of files. Can we figure out a way to summarize or print some sort of console report at the end?

johnarevalo commented 3 years ago

I have experienced limitations for parallel requests on other AWS APIs before. After certain number (e.g. 4 or 8 concurrent calls) there is not time savings.

I ran couple more restorations with 4 and 6 workers for the same number of objects. It took ~35min and ~26min respectively. My guess is this: the limit is 4 concurrent calls, and the script requires an additional thread to coordinate. So setting it to 8 is a conservative default IMO.

About summarizing, we could print counts per status, something like:

REQUESTED      17258
IN_PROGRESS       19
ERROR              1
RESTORED           0

For more info check path/to/log/output.csv
bethac07 commented 3 years ago

Yeah, I've run into other throttling stuff as well, which is why I initially hadn't dug too deeply into parallelization, because I wanted to make sure to stay under the throttling limit.

I think that's a great way of summarizing!

shntnu commented 3 years ago

have experienced limitations for parallel requests on other AWS APIs before. After certain number (e.g. 4 or 8 concurrent calls) there is not time savings.

That's good to know!

johnarevalo commented 3 years ago

@bethac07, last push includes the suggested changes.