Open ctb opened 2 years ago
(that's 2 billion k-mers. that's a lot. 🤔 )
For NCBI taxonomy, with GTDB rs207 genomes -
20744791 of 22792206 hashvals (91.0%) are perfectly informative at species level!
779234 of 2047415 hashvals (38.1%) are perfectly informative at genus level!
245718 of 1268181 hashvals (19.4%) are perfectly informative at family level!
The remaining 1,022,463 hashvals (about 4.5% of total hashvals) are then taxonomically confused in NCBI, solely because of taxonomy.
Decided to take a look at individual genomes - which is much faster/easier -
15378449 of 22792206 hashvals (67.5%) are perfectly informative at genome level!
I think the 209k vs 1mn taxonomy-confusing hashvals may be a useful quantitative measure for illustrating the relative "quality" of the two taxonomies. We wouldn't expect no confusing hashvals, just because of chance and biology, but that 5-fold difference feels like it's telling us something.
I wonder what the value is, if you randomise the taxonomic assignments?
I was seeking a similar kind of "one number measure" to unicity distance, but for taxonomic incoherency, and I decided to try out k-mer Shannon entropy - link.
The basic idea would be: if a k-mer has a Shannon entropy of 0 at species level, that k-mer is perfectly informative for a species. (You can calculate Shannon entropy at any taxonomic rank.)
We could also calculate Shannon entropy for k-mers at a genomic level, but I suspect that unicity distance #1 is more informative for things we care about there - unicity distance measures how many k-mers we need in order to distinguish this genome from all others. Shannon entropy and unicity distance are related in at least one particularly obvious way: if a k-mer has a Shannon entropy of 0 for a genome, then that genome has a unicity distance of 0 (because if you see that k-mer, you know you've got that genome).
as a side note, the thing that made this scalable was the
LCA_SqliteDatabase
which let me build an LCA database for GTDB rs207!some preliminary results
notebook - https://github.com/ctb/2022-sourmash-sens-spec/blob/main/explore-tax-incoherency-kmer-entropy.ipynb
for GTDB rs207, k=31, scaled=10,000 -
21150287 of 22792206 hashvals (92.8%) are perfectly informative at species level!
of the remaining 1641919 hashvals, 1262281 of 1641919 hashvals (76.9%) are perfectly informative at genus level!
going on to family - 170249 of 379638 hashvals (44.8%) are perfectly informative at family level!
if this is correct, it would then mean that the remaining 209,389 hashvals (about 0.9% of total hashvals) are responsible for all the taxonomic confusion in GTDB for single-k-mer approaches.