Open nick-youngblut opened 1 year ago
Currently CheckM2 chunks both the diamond annotation (500 genomes per chunk) as well as feature vector generation & ML prediction (250 genomes per chunk) to avoid having a large memory footprint. If you annotate in parallel (e.g. 10 batches of 5 threads in parallel as opposed to 5 sequential batches of 50 threads) then the memory requirements are probably going to be a fair bit higher.
As stated in diamond.py:
For large numbers of genomes (e.g., 10k or 100k MAGs), it would be best to annotation genomes in batches, with each batch annotated in a separate job. Then, the merged annotations can be provided as input to
checkm2 predict
. This should scale better than just only DIAMOND job for all genes in all genomes.All that would likely be necessary to implement this is to allow for gene annotation files as input (similar to
--genes
incheckm2 predict
) and skip the gene calling & annotation steps.