Ecogenomics / GTDBTk

GTDB-Tk: a toolkit for assigning objective taxonomic classifications to bacterial and archaeal genomes.
https://ecogenomics.github.io/GTDBTk/
GNU General Public License v3.0
479 stars 82 forks source link

Small (mock) reference data #572

Closed bernt-matthias closed 6 months ago

bernt-matthias commented 9 months ago

For the Galaxy tool the classify workflow it would be great if there would be small reference data to run a test in the CI.

Is there a possibility to do this?

donovan-h-parks commented 8 months ago

Hi Brent. GTDB-TK has the command check_install to verify if all reference data and third party dependencies are as expected. It also has the test command which runs a set of included genomes through the classify workflow as an integration test. You can use the genomes processed by the test command and included in the GTDB-Tk reference data if you prefer to run CI differently.

https://ecogenomics.github.io/GTDBTk/commands/check_install.html https://ecogenomics.github.io/GTDBTk/commands/test.html

bernt-matthias commented 8 months ago

The processed genomes are not the problem. The problem is the reference data (i.e. GTDB). We are wondering if there is a small database (or can be constructed) that we can use in tests.

Our problem is that we have hundreds of tools in one of the main Galaxy tool repos (https://github.com/galaxyproject/tools-iuc/) and have to restrict to small tests (and reference) data.

donovan-h-parks commented 8 months ago

Hi Brent,

Makes sense, but unfortunately we don't have such a set of reference data. What might work for you is to run a single genome that belongs to a species in the GTDB reference database. This will result in only the ANI prescreen part of GTDB-Tk running and thus avoid the memory requirement and time required by the tree place (pplacer) step. This obviously isn't a full test of GTDB-Tk, but at least demonstrates it still runs.

Cheers, Donovan