taxonomic entropy for k-mers

ctb commented 2 years ago

I was seeking a similar kind of "one number measure" to unicity distance, but for taxonomic incoherency, and I decided to try out k-mer Shannon entropy - link.

The basic idea would be: if a k-mer has a Shannon entropy of 0 at species level, that k-mer is perfectly informative for a species. (You can calculate Shannon entropy at any taxonomic rank.)

We could also calculate Shannon entropy for k-mers at a genomic level, but I suspect that unicity distance #1 is more informative for things we care about there - unicity distance measures how many k-mers we need in order to distinguish this genome from all others. Shannon entropy and unicity distance are related in at least one particularly obvious way: if a k-mer has a Shannon entropy of 0 for a genome, then that genome has a unicity distance of 0 (because if you see that k-mer, you know you've got that genome).

as a side note, the thing that made this scalable was the LCA_SqliteDatabase which let me build an LCA database for GTDB rs207!

some preliminary results

notebook - https://github.com/ctb/2022-sourmash-sens-spec/blob/main/explore-tax-incoherency-kmer-entropy.ipynb

for GTDB rs207, k=31, scaled=10,000 -

21150287 of 22792206 hashvals (92.8%) are perfectly informative at species level!

of the remaining 1641919 hashvals, 1262281 of 1641919 hashvals (76.9%) are perfectly informative at genus level!

going on to family - 170249 of 379638 hashvals (44.8%) are perfectly informative at family level!

if this is correct, it would then mean that the remaining 209,389 hashvals (about 0.9% of total hashvals) are responsible for all the taxonomic confusion in GTDB for single-k-mer approaches.

ctb commented 2 years ago

(that's 2 billion k-mers. that's a lot. 🤔 )

ctb commented 2 years ago

For NCBI taxonomy, with GTDB rs207 genomes -

20744791 of 22792206 hashvals (91.0%) are perfectly informative at species level!

779234 of 2047415 hashvals (38.1%) are perfectly informative at genus level!

245718 of 1268181 hashvals (19.4%) are perfectly informative at family level!

The remaining 1,022,463 hashvals (about 4.5% of total hashvals) are then taxonomically confused in NCBI, solely because of taxonomy.

ctb commented 2 years ago

Decided to take a look at individual genomes - which is much faster/easier -

15378449 of 22792206 hashvals (67.5%) are perfectly informative at genome level!

widdowquinn commented 2 years ago

I think the 209k vs 1mn taxonomy-confusing hashvals may be a useful quantitative measure for illustrating the relative "quality" of the two taxonomies. We wouldn't expect no confusing hashvals, just because of chance and biology, but that 5-fold difference feels like it's telling us something.

I wonder what the value is, if you randomise the taxonomic assignments?

ctb / 2022-sourmash-sens-spec

taxonomic entropy for k-mers #2

some preliminary results