Open ctb opened 2 years ago
THIS IS SO COOL.
Can you add taxonomy and number of genomes for a given species in GTDB as columns to help contextualize? like i bet unicity is higher for e. coli because we have so many e. coli genomes.
Also, can you comment on whether it has to be a specific k-mer or if any of the k-mers will do? or rather, of the number of k-mers in the genome, how many fit the unicity criteria and could be used as the 1 kmer to identify the specific genome?
THIS IS SO COOL.
:) it does seem to nicely wrap a bunch of concerns up into a single number!
Can you add taxonomy and number of genomes for a given species in GTDB as columns to help contextualize? like i bet unicity is higher for e. coli because we have so many e. coli genomes.
yeah, exactly. working on it - I'm trying to figure out what a similar number might look like for a taxonomy.
Also, can you comment on whether it has to be a specific k-mer or if any of the k-mers will do? or rather, of the number of k-mers in the genome, how many fit the unicity criteria and could be used as the 1 kmer to identify the specific genome?
unicity distance is formally defined as the minimum, so there is no shorter list of k-mers (...see the connection to min-set-cov? 😆 )
there may be multiple collections of k-mers that can be used, tho. maybe a way to phrase that question would be "what number of the hashes in the genome can uniquely identify the genome?" but this only makes sense in the case where the unicity distance is 1 anyway.
right now the easiest/simplest interpretations are when unicity is 1 or infinite - 1 breaks LCA-style approaches for genome identification, infinite breaks gather-style approaches for genome identification.
p.s. currently calculating estimated unicity for all genomes in GTDB rs207.
That's fascinating!
I see that GCF_001084125.1 is Neisseria gonorrhoeae, which is both an important and interesting case study for taxonomic distinction, and Neisseria are often considered challenging because they're highly recombinogenic, very closely-related to other Neisseria species, and tend to co-inhabit the same body space (the back of your nose, IIRC). I've used these when trying to convince people of the value of ANI in the past… some slides from an old talk…
There may be something more recent than this paper but it's a good place to start on the why of them being a challenging group.
very cool - I suspect that this unicity number is going to be a good, simple way to find all the challenging genomes 😆
Thanks for sharing this number for me! I am curious to see if there is a correlation between the unicity number and the genotyping accuracy!
First pass x GTDB rs207 - 15.3% of genomes have unicity distance of 1; 29.2% have an infinite unicity distance (k=31, scaled=1000).
Note that the 29.2% is not an estimate, but the 15.3% is a lower bound - there may be more genomes that have a unicity distance of 1.
notebook: https://github.com/ctb/2022-sourmash-sens-spec/blob/main/explore-unicity.ipynb
species | |
---|---|
s__Staphylococcus aureus | 8367 |
s__Salmonella enterica | 7602 |
s__Escherichia coli | 6324 |
s__Streptococcus pneumoniae | 5513 |
s__Mycobacterium tuberculosis | 5485 |
s__Klebsiella pneumoniae | 4428 |
s__Acinetobacter baumannii | 2519 |
s__Pseudomonas aeruginosa | 1873 |
s__Streptococcus pyogenes | 1399 |
s__Listeria monocytogenes | 1207 |
s__Mycobacterium abscessus | 1139 |
s__Listeria monocytogenes_B | 1097 |
s__Clostridioides difficile | 958 |
s__Burkholderia mallei | 937 |
s__Neisseria meningitidis | 893 |
s__Streptococcus suis | 890 |
s__Wolbachia pipientis | 869 |
s__Pseudomonas_E viridiflava | 855 |
s__Vibrio cholerae | 854 |
s__Enterococcus_B faecium | 851 |
genus | |
---|---|
g__Streptococcus | 9682 |
g__Staphylococcus | 9476 |
g__Salmonella | 7700 |
g__Mycobacterium | 6875 |
g__Escherichia | 6383 |
g__Klebsiella | 5032 |
g__Acinetobacter | 2772 |
g__Listeria | 2456 |
g__Pseudomonas | 1900 |
g__Pseudomonas_E | 1875 |
g__Vibrio | 1794 |
g__Burkholderia | 1654 |
g__Neisseria | 1397 |
g__Campylobacter_D | 1389 |
g__Enterococcus_B | 1178 |
g__Clostridioides | 960 |
g__Bordetella | 938 |
g__Francisella | 906 |
g__Wolbachia | 905 |
g__Enterococcus | 787 |
oooh very cool. I think this shows either species (or genera) with a ton of genomes in GenBank or GTDB (like e. coli, p. arg, etc.) or genomes with high swappiness (Neisseria, Wolbachia)
Q: How many sourmash hashes does it take to classify a genome uniquely?
If the answer is 1, then there is a single hash that identifies this specific genome! 🎉
If the answer is “you cannot”, then you are looking at genomes that cannot be resolved with sourmash gather.
This number seems to be something called the unicity distance, and it is relatively straightforward to calculate!
Here are some unicity distances for a few GTDB genome sketches - the first column is the identifier, the second is the unicity distance, the third is the number of hashes in the sketch, and the fourth is the percentage of hashes that is the unicity distance; 100% would mean that it cannot be resolved by sourmash.
GCF_000814905.1 1 434 0.2% GCA_007116955.1 1 247 0.4% GCF_017948435.1 1 608 0.2% GCA_017995835.1 1 317 0.3% GCF_001981135.1 5 723 0.7% GCF_900111215.1 1 223 0.4% GCF_001084125.1 153 204 75.0% GCF_002224665.1 1 506 0.2% GCA_900758335.1 1 363 0.3% GCF_008326505.1 1 202 0.5% GCF_001204095.1 1 266 0.4% GCA_018665005.1 1 231 0.4% GCA_900478005.2 108 405 26.7% GCF_003812625.1 28 525 5.3% GCF_002088895.1 3 314 1.0% GCA_002397945.1 1 122 0.8% GCA_004116405.1 2 517 0.4% GCF_003117635.1 2 493 0.4% GCA_902789845.1 1 210 0.5% GCF_016628485.1 1 255 0.4%
One question to help me understand this. You say "100% would mean that it cannot be resolved by sourmash". The way I understand it, it would be the opposite: 0% (that is as I understand it: 0 unique hashes would mean that sourmash cannot identify that genome. Please explain. Thank you.
notebook: https://github.com/ctb/2022-sourmash-sens-spec/blob/main/explore-unicity.ipynb
top 20 species with infinite unicity (k=31, scaled=1000)
species sStaphylococcus aureus 8367 sSalmonella enterica 7602 sEscherichia coli 6324 sStreptococcus pneumoniae 5513 sMycobacterium tuberculosis 5485 sKlebsiella pneumoniae 4428 sAcinetobacter baumannii 2519 sPseudomonas aeruginosa 1873 sStreptococcus pyogenes 1399 sListeria monocytogenes 1207 sMycobacterium abscessus 1139 sListeria monocytogenes_B 1097 sClostridioides difficile 958 sBurkholderia mallei 937 sNeisseria meningitidis 893 sStreptococcus suis 890 sWolbachia pipientis 869 sPseudomonas_E viridiflava 855 sVibrio cholerae 854 sEnterococcus_B faecium 851
top 20 genera with infinite unicity (k=31, scaled=1000)
genus gStreptococcus 9682 gStaphylococcus 9476 gSalmonella 7700 gMycobacterium 6875 gEscherichia 6383 gKlebsiella 5032 gAcinetobacter 2772 gListeria 2456 gPseudomonas 1900 gPseudomonas_E 1875 gVibrio 1794 gBurkholderia 1654 gNeisseria 1397 gCampylobacter_D 1389 g__Enterococcus_B 1178 gClostridioides 960 gBordetella 938 gFrancisella 906 gWolbachia 905 g__Enterococcus 787
Again, to see if I get it right: infinity unicity means that the genome, species, or genus, does not have a single hash that is unique to it (present in the genome, species or genus, but not in any other genome). And, if the unicity distance of a species or genus were to be one, does that mean that this one hash is present in every single genome of the species or genus?
One question to help me understand this. You say "100% would mean that it cannot be resolved by sourmash". The way I understand it, it would be the opposite: 0% (that is as I understand it: 0 unique hashes would mean that sourmash cannot identify that genome. Please explain. Thank you.
My fault - It's an oddity of the way I'm doing the analysis and reporting. If the analysis reaches the end of the hashes without uniquely identifying the genome, it has infinite unicity distance. If it takes 50% of the hashes to get there, that number of hashes is the unicity distance. A unicity distance of 0 isn't possible, a unicity distance of 1 means a single hash has been found that uniquely identifies this genome.
Again, to see if I get it right: infinity unicity means that the genome, species, or genus, does not have a single hash that is unique to it (present in the genome, species or genus, but not in any other genome). And, if the unicity distance of a species or genus were to be one, does that mean that this one hash is present in every single genome of the species or genus?
Here we're dealing only with genomes - not species or genus. Species and genus are taxonomic terms and I haven't figured out how to calculate unicity for them (I'm trying out a different measure in #2).
So, yes, infinite unicity distance means that the genome does not have any single hash or combination of hashes in it that is unique to that genome. This would be a consequence of sourmash's hashing of course - clearly there is at least some difference between the genomes otherwise they wouldn't be different genomes - but sourmash isn't finding anything with k=31 and 1000.
And a unicity distance of 1 means that there is at least one individiual hash (there may be many individual hashes!) that can identify that genome uniquely.
Thanks for the questions!
(I'll improve the way I'm talking about this to be more correct ;)
Perfect! I got it now! Thank you, Titus!
Q: How many sourmash hashes does it take to classify a genome uniquely?
If the answer is 1, then there is a single hash that identifies this specific genome! :tada:
If the answer is “you cannot”, then you are looking at genomes that cannot be resolved with sourmash gather.
This number seems to be something called the unicity distance, and it is relatively straightforward to calculate!
Here are some unicity distances for a few GTDB genome sketches - the first column is the identifier, the second is the unicity distance, the third is the number of hashes in the sketch, and the fourth is the percentage of hashes that is the unicity distance; 100% would mean that it cannot be resolved by sourmash.