RobertLuptonTheGood / eups

A version manager tracking product dependencies
19 stars 19 forks source link

Improvement: Speeding up EUPS #138

Open mwittgen opened 3 years ago

mwittgen commented 3 years ago

eups distrib install calls run very slow as hundreds with temp files are cached by each invocation by sequential http gets. This is a suggestion to run getTaggedProductInfo in a thread pool which speeds up the significantly by only changing a few lines in distrib/server.py. listDir retrieves the same data from a handful of web directories multiple times. Results are cached in a dictionary to have only one http get per web directory. https://github.com/RobertLuptonTheGood/eups/compare/master...mwittgen:master

More optimizations/suggestions: configure number of pool threads from eups command line. Allow setting the chunk size for url reads from command line or replace by urllib3 streaming. Run non recursive installations (product + dependencies) in a thread pool.

timj commented 3 years ago

Sounds interesting. Can you make a pull request please? (and preferably a Jira ticket branch)

RobertLuptonTheGood commented 3 years ago

I think it's done this way due to a robots.txt file that NCSA (used to?) have meaning that eups can't pull down all the files in one go. It'd probably be worth looking at seeing if this is still the case.

ktlim commented 3 years ago

As far as I know, robots.txt can only allow or disallow access, not rate-limit. It is also interpreted by the client, not the server. Finally, the Rubin Observatory eups package repository is no longer hosted at NCSA.

RobertLuptonTheGood commented 3 years ago

The configuration meant that I couldn't pull down the entire directory in one transaction. If that is no longer the case, I think that this change might well solve the reported problem.

mwittgen commented 3 years ago

I don't think the http protocol supports getting multiple files in one transactions unless there's some server side support to tar/zip such a request into a message body, which leaves only the option of having multiple gets in parallel. Since many small files are requested at the beginning of each eups run this is not efficient. yum for example bundles all the repo metadata into larger zip files. Downside is the metadata files need to be updated when the repo content changes.