Closed bernt-matthias closed 6 months ago
Hi Brent. GTDB-TK has the command check_install
to verify if all reference data and third party dependencies are as expected. It also has the test
command which runs a set of included genomes through the classify workflow as an integration test. You can use the genomes processed by the test
command and included in the GTDB-Tk reference data if you prefer to run CI differently.
https://ecogenomics.github.io/GTDBTk/commands/check_install.html https://ecogenomics.github.io/GTDBTk/commands/test.html
The processed genomes are not the problem. The problem is the reference data (i.e. GTDB). We are wondering if there is a small database (or can be constructed) that we can use in tests.
Our problem is that we have hundreds of tools in one of the main Galaxy tool repos (https://github.com/galaxyproject/tools-iuc/) and have to restrict to small tests (and reference) data.
Hi Brent,
Makes sense, but unfortunately we don't have such a set of reference data. What might work for you is to run a single genome that belongs to a species in the GTDB reference database. This will result in only the ANI prescreen part of GTDB-Tk running and thus avoid the memory requirement and time required by the tree place (pplacer) step. This obviously isn't a full test of GTDB-Tk, but at least demonstrates it still runs.
Cheers, Donovan
For the Galaxy tool the classify workflow it would be great if there would be small reference data to run a test in the CI.
Is there a possibility to do this?