galaxyproteomics / tools-galaxyp

Galaxy Tool Shed repositories maintained and developed by the GalaxyP community
MIT License
34 stars 57 forks source link

EggNOG: Implement possibility of two stage mode #673

Closed bernt-matthias closed 1 year ago

bernt-matthias commented 1 year ago

As described here https://github.com/eggnogdb/eggnog-mapper/wiki/eggNOG-mapper-v2.1.5-to-v2.1.8#Setting_up_large_annotation_jobs

Wondering if I should split this into two tools (maybe also keeping the 'monolithic' one). Advantage would be that admins can set different destinations for the CPU intense search and IO intense annotation stage.

Also adds the cache mode. And makes the tests more specific by using regexes instead of sim_size

bernt-matthias commented 1 year ago

Ping @abretaud @jj-umn and @bgruening: what do you think about splitting the tool?

bernt-matthias commented 1 year ago

@bernt-matthias do you have performance problems?

I guess two tools make technical sense but is this really a problem for our users? How much in UX do we decrease if we split it up? For hard-code people, we can now run this tool twice correct? I guess this is enough to address both use-cases.

One of my users has a FASTA with 12,000,000 sequences. The search phase runs several days (with good CPU usage on 10cores) and the annotation phase runs more than a week using only a few percent of the 10 cores (currently my max run time). Might be that the new --scratch_dir and --temp_dir parameters help. I'm happy to test first.