apcamargo / genomad

geNomad: Identification of mobile genetic elements
https://portal.nersc.gov/genomad/
Other
169 stars 17 forks source link

How should genomad be used appropriately, when there are a huge number of assemblies to be processed. #59

Open actledge opened 6 months ago

actledge commented 6 months ago

Hi,

I have similar questions with Issue26(https://github.com/apcamargo/genomad/issues/26#issue-1796936020). But I have a large number of assemblies to explore for prophages using genomad (It could be several hundred GB or even a few TB.). I guess that machines may not be able to support running all genomes in a single file. And, due to the large amount of data, it may be necessary to use the score calibration function to reduce the number of false positive sequences. So, it seems inappropriate to run each assembly separately (most of them have less than 1000 contigs).

So, I would like to consult with you on how to handle this kind of situation where there is a large batch of data that may not be suitable for running all together, while also needing to use score calibration due to the large amount of data. Maybe I should split my data into multiple parts, and then merge them separately? If this needs to be done, another question for me is how many parts should I split it into. Do you know the approximate ratio between the memory usage of genomad and the input data size?

By the way, is the --splits parameter effective in this situation (i.e., whether the memory usage of the database is only related to its own size and unrelated to the size of the input data)?

Thanks!

apcamargo commented 6 months ago

The --splits parameter will split the target database, not the query. So I expect that the memory consumption will not be affected by --splits when your input gets bigger.

Maybe I should split my data into multiple parts, and then merge them separately? If this needs to be done, another question for me is how many parts should I split it into. Do you know the approximate ratio between the memory usage of genomad and the input data size?

That's precisely what I do. I concatenate samples and them use seqkit split2 to split in chunks containing ~75k sequences each. But the reason I do this is to run each chunk in a separate node to accelerate the classification process. I haven't evaluated the ratio between the memory usage and the input data size. Maybe geNomad will be able to handle much larger inputs.

By the way, is the --splits parameter effective in this situation (i.e., whether the memory usage of the database is only related to its own size and unrelated to the size of the input data)?

Good question. As I mentioned, this parameter splits the target database (the marker database), not your input. It works well to solve issues where the hardware can't handle the target database. I've never benchmarked how memory consumption increases as inputs get bigger. I've classified some pretty large files without an issue, though.