running checkm2 on a large number of genomes

chklovski / CheckM2

Assessing the quality of metagenome-derived genome bins using machine learning

GNU General Public License v3.0

167 stars 19 forks source link

running checkm2 on a large number of genomes #79

Open sherlyn99 opened 1 year ago

sherlyn99 commented 1 year ago

Hi I have ~1 million of isolate genomes and I wanted to run checkm2 to assess their completeness and contamination. I came across #67 and I was wondering what is a good way to parse a large number of genomes into checkm2 predict?

I am currently doing

checkm2 predict \ -t 30 \ -i $(cat ) \ -o \ --database_path \ --remove_intermediates

1) Is there a limit to the number of files in filelist.txt? 2) Is there a better way to do this other than parsing a list of file paths?

Thank you so much!

chklovski commented 1 year ago

Hi,

Sorry for the late reply. I don't think in principle there should be any limits to passing as many inputs as you want to --input, though python's argparse may have system-wide limitations, as does whatever OS you're using (e.g. specific linux distro).

Parsing in a list of files in a txt seems fine. I have .tar archive input on a future features list for CheckM2, but some sections of the code need to be re-written to avoid tarbombs.

Please let me know if you encounter issues with the workflow, as I've never run CheckM2 on that many genomes, would be good to know if it can handle it.

sherlyn99 commented 12 months ago

Thank you such for getting back to me! I am writing to provide an update:

I have been running a job array of ~500 jobs, each containing 2750 genomes. However, I am frequently encountering the issue of out-of-memory error. I am currently supplying each job with 200g memory and 48 hours and some jobs fail with out of memory error (exit with status 125). Do you have any suggestions of how much memory I should request for each job so that the job array can be run smoothly? Thank you!