galaxyproject / ephemeris

Library for managing Galaxy plugins - tools, index data, and workflows.
https://ephemeris.readthedocs.org/
Other
27 stars 38 forks source link

Proper and fast check if repos are installed. #141

Closed rhpvorderman closed 4 years ago

rhpvorderman commented 5 years ago

Works around #140 @erasche , can you run this on the EU galaxy tool lists and report your findings? (Maybe make an image of the server first... All the testing succeeds but... you never know.).

rhpvorderman commented 5 years ago

I also tackled the very inefficient search mechanism. Now it makes a set of installed repos and compares against the set. Instead of looping over the list of installed tools. This should create some speedup on a server with thousands of tools installed.

hexylena commented 5 years ago

Absolutely, I'll try to do that this evening! Thanks so much for tackling this

hexylena commented 5 years ago

We live dangerously so it's running in prod. https://build.galaxyproject.eu/job/usegalaxy-eu/job/install-tools/117/console Hopefully it works!

rhpvorderman commented 5 years ago

It has to call the galaxy api quite a lot still. So if it is slow I can think of another way to speed it up... But that will require some more work. I hope this is sufficient.

rhpvorderman commented 5 years ago

Hmm still running for an hour... It's a bit annoying that it is not more verbose..

hexylena commented 5 years ago

This seems to have taken even longer, but not sure if that's something on our end? We can keep trying. The build took 12 hours and was killed(ish) at midnight by an automated process.

rhpvorderman commented 5 years ago

@erasche. There was no way to check if this workaround was faster by putting it to a real production test. I guess the API call for each of the installed tools takes way too much time. This is the only way to determine if galaxy is actually going to skip the tool. But this check probably is just as slow as actually trying to install a new tool. I am affraid it is not really possible to workaround around the slowness of the API in that case.

I removed the api check. By using sets to determine if a tool is already installed this should speed up the process a bit, given the quite big number of tools on the galaxy server.

hexylena commented 5 years ago

There was no way to check if this workaround was faster by putting it to a real production test.

Oh, all good. We're happy to be that test :)

rhpvorderman commented 5 years ago

@erasche. I think this problem is not solvable unless the galaxy api is fixed. Deducing whether galaxy is going to install a tool or not requires an API call for each tool on the galaxy server, which is extremely slow. Alternatively the list of all available repos on the toolshed can be downloaded to make a list of installable revisions. If a galaxy tool is not an installable revision the API call method can be used. This might save some time, but I doubt it. It also introduces the overhead of having to download the repolists from each toolshed (~10 mb of data) for each installed yaml. Which is not worth it.

Nevertheless, comparing using sets should save some time on the install. So that change can be kept from this PR.

rhpvorderman commented 5 years ago

@mvdbeek can you have a look at this PR? It does a set comparision to check if stuff is already installed. As such it should speedup the install scripts on big galaxies with multiple tools installed.

mvdbeek commented 5 years ago

Well, this doesn't optimise the bottleneck and introduces more code. If we do want to optimise the code (I don't think we should) I'd just build sets of tuples up front and use the asymmetric difference.

rhpvorderman commented 4 years ago

Close this as it does not solve the bottleneck as @mvdbeek stated.