Open taylorreiter opened 4 years ago
I don't see any inherent barrier - LCA methods always aggregate to at least one level higher than the lowest taxonomic rank present, but if we did a bunch of species level classification on the full set of GTDB genomes, you'd be able to get at strain simply by looking at differences between members of the same species.
I think the bigger barrier here is choosing cutoffs to delineate between genus, species, and strain. It's not clear to me from my various readings that there's a natural cutoff anywhere above 97% similarity... so how do you decide that something is a new strain vs a new species?
Ah that's a good point. Esp for clinically relevant strains, a pathogenic strain may be 99.99% similar to a non-pathogenic strain given that a single SNP or a single HGT could confer pathogenicity. These orgs would be referred to as separate strains in the medical literature, but could look identical using sourmash depending on the size/placement of the changes.
soooo probably very tricky.
I think my thought here though is that we could say contig 1 and contig 2 have a lot of the same k-mers, and those k-mers are all in general more abundant than most of the other k-mers in the genome bin, and thus probably are duplicate/redundant contigs of similar genomic regions in strains.
At some very distant future date, could we somehow use k-mer abundance to estimate strain heterogeneity? The GTDB LCA database has two layers of aggregation that obscure strain-level differences: First, GTDB guide genomes are representatives, so other closely related strains aren't captured in the database, so only one strain should assigned to closely related strains that are of the same species. Second, LCA methods aggregate to species level as the lowest taxonomic rank (I think). These two things would obscure if two strains were present in the same genome bin.
Currently, we've tried charcoal with a genus-level cutoff for contamination. With the "clean" contigs, could we use cross-contig k-mer abundance profiles to estimate strain heterogeneity, e.g. contigs that contain mostly the same genes/operons but originate from different genomes from a population of closely related strains?
Might not really be worth the time -- marker-gene based methods might get at this better, just a thought though.