apcamargo / genomad

geNomad: Identification of mobile genetic elements
https://portal.nersc.gov/genomad/
Other
169 stars 17 forks source link

[feature request] query database clustering #95

Open valentynbez opened 2 months ago

valentynbez commented 2 months ago

As mmseqs2 is already part of the pipeline, it would be nice to see an opportunity to cluster a query database before aligning it against the marker database. Phage genes are redundant and even 100% deduplication might shorten the computation time. Afterwards, the results can be mapped back from cluster representatives to initial sequences for downstream classification. It is not a useful feature for small datasets, but for larger ones, it can reduce computational time and RAM usage significantly.

apcamargo commented 1 month ago

Great idea!

This would require a lot of benchmarks to evaluate:

Adding this would also involve rewriting a good amount of code, so it's not something that I can implement quickly. But I really like the idea and will evaluate it for future releases.