chklovski / CheckM2

Assessing the quality of metagenome-derived genome bins using machine learning
GNU General Public License v3.0
165 stars 19 forks source link

checkm2 predict: diamond annotations as input #40

Open nick-youngblut opened 1 year ago

nick-youngblut commented 1 year ago

As stated in diamond.py:

Diamond only accepts single inputs, so we concat protein files and chunk them as input using tempfile

For large numbers of genomes (e.g., 10k or 100k MAGs), it would be best to annotation genomes in batches, with each batch annotated in a separate job. Then, the merged annotations can be provided as input to checkm2 predict. This should scale better than just only DIAMOND job for all genes in all genomes.

All that would likely be necessary to implement this is to allow for gene annotation files as input (similar to --genes in checkm2 predict) and skip the gene calling & annotation steps.

chklovski commented 1 year ago

Currently CheckM2 chunks both the diamond annotation (500 genomes per chunk) as well as feature vector generation & ML prediction (250 genomes per chunk) to avoid having a large memory footprint. If you annotate in parallel (e.g. 10 batches of 5 threads in parallel as opposed to 5 sequential batches of 50 threads) then the memory requirements are probably going to be a fair bit higher.