unicity distances for genomes

ctb commented 2 years ago

Q: How many sourmash hashes does it take to classify a genome uniquely?

If the answer is 1, then there is a single hash that identifies this specific genome! :tada:

If the answer is “you cannot”, then you are looking at genomes that cannot be resolved with sourmash gather.

This number seems to be something called the unicity distance, and it is relatively straightforward to calculate!

Here are some unicity distances for a few GTDB genome sketches - the first column is the identifier, the second is the unicity distance, the third is the number of hashes in the sketch, and the fourth is the percentage of hashes that is the unicity distance; 100% would mean that it cannot be resolved by sourmash.

GCF_000814905.1 1 434 0.2%
GCA_007116955.1 1 247 0.4%
GCF_017948435.1 1 608 0.2%
GCA_017995835.1 1 317 0.3%
GCF_001981135.1 5 723 0.7%
GCF_900111215.1 1 223 0.4%
GCF_001084125.1 153 204 75.0%
GCF_002224665.1 1 506 0.2%
GCA_900758335.1 1 363 0.3%
GCF_008326505.1 1 202 0.5%
GCF_001204095.1 1 266 0.4%
GCA_018665005.1 1 231 0.4%
GCA_900478005.2 108 405 26.7%
GCF_003812625.1 28 525 5.3%
GCF_002088895.1 3 314 1.0%
GCA_002397945.1 1 122 0.8%
GCA_004116405.1 2 517 0.4%
GCF_003117635.1 2 493 0.4%
GCA_902789845.1 1 210 0.5%
GCF_016628485.1 1 255 0.4%

taylorreiter commented 2 years ago

THIS IS SO COOL.

Can you add taxonomy and number of genomes for a given species in GTDB as columns to help contextualize? like i bet unicity is higher for e. coli because we have so many e. coli genomes.

Also, can you comment on whether it has to be a specific k-mer or if any of the k-mers will do? or rather, of the number of k-mers in the genome, how many fit the unicity criteria and could be used as the 1 kmer to identify the specific genome?

ctb commented 2 years ago

THIS IS SO COOL.

:) it does seem to nicely wrap a bunch of concerns up into a single number!

Can you add taxonomy and number of genomes for a given species in GTDB as columns to help contextualize? like i bet unicity is higher for e. coli because we have so many e. coli genomes.

yeah, exactly. working on it - I'm trying to figure out what a similar number might look like for a taxonomy.

Also, can you comment on whether it has to be a specific k-mer or if any of the k-mers will do? or rather, of the number of k-mers in the genome, how many fit the unicity criteria and could be used as the 1 kmer to identify the specific genome?

unicity distance is formally defined as the minimum, so there is no shorter list of k-mers (...see the connection to min-set-cov? 😆 )

there may be multiple collections of k-mers that can be used, tho. maybe a way to phrase that question would be "what number of the hashes in the genome can uniquely identify the genome?" but this only makes sense in the case where the unicity distance is 1 anyway.

right now the easiest/simplest interpretations are when unicity is 1 or infinite - 1 breaks LCA-style approaches for genome identification, infinite breaks gather-style approaches for genome identification.

ctb commented 2 years ago

p.s. currently calculating estimated unicity for all genomes in GTDB rs207.

widdowquinn commented 2 years ago

That's fascinating!

I see that GCF_001084125.1 is Neisseria gonorrhoeae, which is both an important and interesting case study for taxonomic distinction, and Neisseria are often considered challenging because they're highly recombinogenic, very closely-related to other Neisseria species, and tend to co-inhabit the same body space (the back of your nose, IIRC). I've used these when trying to convince people of the value of ANI in the past… some slides from an old talk…

There may be something more recent than this paper but it's a good place to start on the why of them being a challenging group.

ctb commented 2 years ago

very cool - I suspect that this unicity number is going to be a good, simple way to find all the challenging genomes 😆

shokrof commented 2 years ago

Thanks for sharing this number for me! I am curious to see if there is a correlation between the unicity number and the genotyping accuracy!

ctb commented 2 years ago

First pass x GTDB rs207 - 15.3% of genomes have unicity distance of 1; 29.2% have an infinite unicity distance (k=31, scaled=1000).

Note that the 29.2% is not an estimate, but the 15.3% is a lower bound - there may be more genomes that have a unicity distance of 1.

ctb commented 2 years ago

notebook: https://github.com/ctb/2022-sourmash-sens-spec/blob/main/explore-unicity.ipynb

top 20 species with infinite unicity (k=31, scaled=1000)

	species
s__Staphylococcus aureus	8367
s__Salmonella enterica	7602
s__Escherichia coli	6324
s__Streptococcus pneumoniae	5513
s__Mycobacterium tuberculosis	5485
s__Klebsiella pneumoniae	4428
s__Acinetobacter baumannii	2519
s__Pseudomonas aeruginosa	1873
s__Streptococcus pyogenes	1399
s__Listeria monocytogenes	1207
s__Mycobacterium abscessus	1139
s__Listeria monocytogenes_B	1097
s__Clostridioides difficile	958
s__Burkholderia mallei	937
s__Neisseria meningitidis	893
s__Streptococcus suis	890
s__Wolbachia pipientis	869
s__Pseudomonas_E viridiflava	855
s__Vibrio cholerae	854
s__Enterococcus_B faecium	851

top 20 genera with infinite unicity (k=31, scaled=1000)

	genus
g__Streptococcus	9682
g__Staphylococcus	9476
g__Salmonella	7700
g__Mycobacterium	6875
g__Escherichia	6383
g__Klebsiella	5032
g__Acinetobacter	2772
g__Listeria	2456
g__Pseudomonas	1900
g__Pseudomonas_E	1875
g__Vibrio	1794
g__Burkholderia	1654
g__Neisseria	1397
g__Campylobacter_D	1389
g__Enterococcus_B	1178
g__Clostridioides	960
g__Bordetella	938
g__Francisella	906
g__Wolbachia	905
g__Enterococcus	787

taylorreiter commented 2 years ago

oooh very cool. I think this shows either species (or genera) with a ton of genomes in GenBank or GTDB (like e. coli, p. arg, etc.) or genomes with high swappiness (Neisseria, Wolbachia)

bavinatzer commented 2 years ago

Q: How many sourmash hashes does it take to classify a genome uniquely?

If the answer is 1, then there is a single hash that identifies this specific genome! 🎉

If the answer is “you cannot”, then you are looking at genomes that cannot be resolved with sourmash gather.

This number seems to be something called the unicity distance, and it is relatively straightforward to calculate!

Here are some unicity distances for a few GTDB genome sketches - the first column is the identifier, the second is the unicity distance, the third is the number of hashes in the sketch, and the fourth is the percentage of hashes that is the unicity distance; 100% would mean that it cannot be resolved by sourmash.
GCF_000814905.1 1 434 0.2%
GCA_007116955.1 1 247 0.4%
GCF_017948435.1 1 608 0.2%
GCA_017995835.1 1 317 0.3%
GCF_001981135.1 5 723 0.7%
GCF_900111215.1 1 223 0.4%
GCF_001084125.1 153 204 75.0%
GCF_002224665.1 1 506 0.2%
GCA_900758335.1 1 363 0.3%
GCF_008326505.1 1 202 0.5%
GCF_001204095.1 1 266 0.4%
GCA_018665005.1 1 231 0.4%
GCA_900478005.2 108 405 26.7%
GCF_003812625.1 28 525 5.3%
GCF_002088895.1 3 314 1.0%
GCA_002397945.1 1 122 0.8%
GCA_004116405.1 2 517 0.4%
GCF_003117635.1 2 493 0.4%
GCA_902789845.1 1 210 0.5%
GCF_016628485.1 1 255 0.4%

One question to help me understand this. You say "100% would mean that it cannot be resolved by sourmash". The way I understand it, it would be the opposite: 0% (that is as I understand it: 0 unique hashes would mean that sourmash cannot identify that genome. Please explain. Thank you.

bavinatzer commented 2 years ago

notebook: https://github.com/ctb/2022-sourmash-sens-spec/blob/main/explore-unicity.ipynb

top 20 species with infinite unicity (k=31, scaled=1000)

species sStaphylococcus aureus 8367 sSalmonella enterica 7602 sEscherichia coli 6324 sStreptococcus pneumoniae 5513 sMycobacterium tuberculosis 5485 sKlebsiella pneumoniae 4428 sAcinetobacter baumannii 2519 sPseudomonas aeruginosa 1873 sStreptococcus pyogenes 1399 sListeria monocytogenes 1207 sMycobacterium abscessus 1139 sListeria monocytogenes_B 1097 sClostridioides difficile 958 sBurkholderia mallei 937 sNeisseria meningitidis 893 sStreptococcus suis 890 sWolbachia pipientis 869 sPseudomonas_E viridiflava 855 sVibrio cholerae 854 sEnterococcus_B faecium 851

top 20 genera with infinite unicity (k=31, scaled=1000)

genus gStreptococcus 9682 gStaphylococcus 9476 gSalmonella 7700 gMycobacterium 6875 gEscherichia 6383 gKlebsiella 5032 gAcinetobacter 2772 gListeria 2456 gPseudomonas 1900 gPseudomonas_E 1875 gVibrio 1794 gBurkholderia 1654 gNeisseria 1397 gCampylobacter_D 1389 g__Enterococcus_B 1178 gClostridioides 960 gBordetella 938 gFrancisella 906 gWolbachia 905 g__Enterococcus 787

Again, to see if I get it right: infinity unicity means that the genome, species, or genus, does not have a single hash that is unique to it (present in the genome, species or genus, but not in any other genome). And, if the unicity distance of a species or genus were to be one, does that mean that this one hash is present in every single genome of the species or genus?

ctb commented 2 years ago

One question to help me understand this. You say "100% would mean that it cannot be resolved by sourmash". The way I understand it, it would be the opposite: 0% (that is as I understand it: 0 unique hashes would mean that sourmash cannot identify that genome. Please explain. Thank you.

My fault - It's an oddity of the way I'm doing the analysis and reporting. If the analysis reaches the end of the hashes without uniquely identifying the genome, it has infinite unicity distance. If it takes 50% of the hashes to get there, that number of hashes is the unicity distance. A unicity distance of 0 isn't possible, a unicity distance of 1 means a single hash has been found that uniquely identifies this genome.

Again, to see if I get it right: infinity unicity means that the genome, species, or genus, does not have a single hash that is unique to it (present in the genome, species or genus, but not in any other genome). And, if the unicity distance of a species or genus were to be one, does that mean that this one hash is present in every single genome of the species or genus?

Here we're dealing only with genomes - not species or genus. Species and genus are taxonomic terms and I haven't figured out how to calculate unicity for them (I'm trying out a different measure in #2).

So, yes, infinite unicity distance means that the genome does not have any single hash or combination of hashes in it that is unique to that genome. This would be a consequence of sourmash's hashing of course - clearly there is at least some difference between the genomes otherwise they wouldn't be different genomes - but sourmash isn't finding anything with k=31 and 1000.

And a unicity distance of 1 means that there is at least one individiual hash (there may be many individual hashes!) that can identify that genome uniquely.

Thanks for the questions!

ctb commented 2 years ago

(I'll improve the way I'm talking about this to be more correct ;)

bavinatzer commented 2 years ago

Perfect! I got it now! Thank you, Titus!

ctb / 2022-sourmash-sens-spec