Open valentynbez opened 2 months ago
Great idea!
This would require a lot of benchmarks to evaluate:
Adding this would also involve rewriting a good amount of code, so it's not something that I can implement quickly. But I really like the idea and will evaluate it for future releases.
As
mmseqs2
is already part of the pipeline, it would be nice to see an opportunity to cluster a query database before aligning it against the marker database. Phage genes are redundant and even 100% deduplication might shorten the computation time. Afterwards, the results can be mapped back from cluster representatives to initial sequences for downstream classification. It is not a useful feature for small datasets, but for larger ones, it can reduce computational time and RAM usage significantly.