BigDataBiology / SemiBin

SemiBin: metagenomics binning with self-supervised deep learning
https://semibin.rtfd.io/
117 stars 11 forks source link

Efficient taxonomic annotation #69

Closed SilasK closed 2 years ago

SilasK commented 2 years ago

Hey I've seen that you already started to allow running mmseqs outside of SemiBin.

Do you think there could be an efficient way to annotate a large set of samples with taxonomy by first creating a gene catalog e.g. with linclust, annotating the gene catalog once, and then aggregating the taxonomy for each contig.

luispedro commented 2 years ago

I would not recommend doing that without at least a bit of benchmarking. If the clustering is 100% amino acid identity, I think this becomes like reimplementing the mmseqs taxonomy module, but otherwise, I think you have to be careful to not lose precision.