galaxyproject / ephemeris

Library for managing Galaxy plugins - tools, index data, and workflows.
https://ephemeris.readthedocs.org/
Other
27 stars 38 forks source link

Run data managers aggressive parallelization and refactoring. #79

Closed rhpvorderman closed 6 years ago

rhpvorderman commented 6 years ago

While installing a few reference genomes on my galaxy I got annoyed by the indexing steps. These take quite a long time. And run-data-managers only runs one data manager at a time. I feel that job scheduling should be handled by Galaxy and not by run-data-managers so I changed the way that run-data-managers submits jobs.

Now run-data-managers first picks all the data managers that populate source tables (DEFAULT: ["all_fasta"]). Since other data managers depend on these tables. Then it runs them. After that it runs all the other data managers. Let galaxy figure out to schedule all these jobs. This provides a significant speedup when you're adding a vertebrate genome to the list. Instead of watching your bowtie and bwa indexes be created one after another, they are now created simultaneously.

Internally I had to completely overhaul run-data-managers. It is a now a DataManagers object that has a run method. This made a lot of interfunction communication much easier. Also the code is a bit cleaner now. The DataManagers object can now also be used in other scripts.

Since I had to do some testing I overhauled the tests scripts as well. These are now split in 3 parts. The shed-tools testing was quite slow, and I did not want to wait on it all the time. There is now a separate script for testing run-data-managers which made testing a bit easier.

bgruening commented 6 years ago

Sorry, for being so late to the game. This is great, thanks a lot @rhpvorderman!