ctb / 2022-sourmash-sens-spec

Playing around with sens/spec measurements for simulated stuff
BSD 3-Clause "New" or "Revised" License
3 stars 0 forks source link

longest run of taxonomically confused hashes per taxonomy #4

Open ctb opened 1 year ago

ctb commented 1 year ago

I haven't figured out what to call this, but the table below is an incomplete answer to the question:

what’s the largest collection of hashes present in a single genome that leaves you in doubt as to what taxonomic unit it comes from, per given taxon?

For example, from the table below:

I actually can't figure out what its partner is that is in a different class than E. coli, so let me go to a different row to illustrate the partner aspect -

in this case I'd guess it's contamination, but some of the others in the table below might not be.

Anyway, enjoy!

overlap lin name
0 364 d__Bacteria GCA_001894475.1 Escherichia coli strain=687, ASM189447v1
2 261 dBacteria;pProteobacteria GCF_005503355.1 Sphingomonas sp. 1F27F7B strain=1F27F7B, ASM550335v1
1 251 dBacteria;pProteobacteria;c__Gammaproteobacteria GCF_003669905.1 Pseudomonas aeruginosa strain=Pa1810, ASM366990v1
4 159 dBacteria;pProteobacteria;cGammaproteobacteria;oEnterobacterales GCF_018929655.1 Vibrio cholerae O1 strain=11_Lusaka_2018, ASM1892965v1
7 119 dBacteria;pBacteroidota;cBacteroidia;oBacteroidales GCA_003472185.1 Parabacteroides merdae strain=AM14-15, ASM347218v1
14 103 dBacteria;pFirmicutes_A;c__Clostridia GCA_900553485.1 uncultured Clostridium sp., UMGS1619
26 88 dBacteria;pMarinisomatota;cMarinisomatia;oMarinisomatales GCA_018698165.1 Candidatus Marinimicrobia bacterium, ASM1869816v1
16 77 dBacteria;pActinobacteriota;c__Actinomycetia GCF_005889725.1 Nonomuraea zeae strain=DSM 100528, ASM588972v1
11 70 dBacteria;pFirmicutes;c__Bacilli GCF_905311015.1 Bacillus subtilis, NRS6094
15 70 dBacteria;pFirmicutes_A;cClostridia;oOscillospirales GCA_900766145.1 uncultured Oscillospiraceae bacterium, SRS295027_34
9 65 dBacteria;pProteobacteria;c__Alphaproteobacteria GCA_014359905.1 Hoeflea sp., ASM1435990v1
46 62 dArchaea;pHalobacteriota;cHalobacteria;oHalobacteriales GCA_005954745.1 Halostella pelagica strain=DL-M4, ASM595474v1
3 54 GCA_018658425.1 Candidatus Woesearchaeota archaeon, ASM1865842v1
6 49 dBacteria;pBacteroidota;c__Bacteroidia GCA_002256395.1 Bacteroidetes bacterium B1(2017), ASM225639v1
19 38 dBacteria;pProteobacteria;cGammaproteobacteria;oXanthomonadales GCF_001314305.1 Stenotrophomonas acidaminiphila strain=ZAC14D2_NAIMI4_2, ASM131430v1
8 37 dBacteria;pProteobacteria;cGammaproteobacteria;oBurkholderiales GCA_903833455.1 uncultured proteobacterium, freshwater MAG --- MJ120716B_bin-425
5 34 dBacteria;pProteobacteria;cGammaproteobacteria;oPseudomonadales GCA_002389265.1 Gammaproteobacteria bacterium UBA4475, ASM238926v1
24 34 dBacteria;pFirmicutes;cBacilli;oBacillales GCF_008764375.1 Bacillus safensis strain=DE0105, FS22
56 31 dBacteria;pDesulfobacterota GCA_009993185.1 Deltaproteobacteria bacterium, ASM999318v1
40 31 dBacteria;pBacteroidota GCA_903878245.1 uncultured Bacteroidales bacterium, freshwater MAG --- Ja1_bin-1678
10 31 dBacteria;pProteobacteria;cAlphaproteobacteria;oRhizobiales GCF_018129525.1 Bradyrhizobium denitrificans strain=SZCCT0094, ASM1812952v1
25 29 dBacteria;pActinobacteriota;cActinomycetia;oActinomycetales GCA_012927515.1 Cellulomonas sp., ASM1292751v1
81 26 dBacteria;pVerrucomicrobiota;cKiritimatiellae;oRFP12 GCA_017509565.1 Kiritimatiellae bacterium, ASM1750956v1
17 24 dBacteria;pActinobacteriota GCF_015560095.1 Bifidobacterium adolescentis strain=1001270J_160509_E8, ASM1556009v1
58 21 dBacteria;pProteobacteria;cAlphaproteobacteria;oRhodospirillales GCA_018654935.1 Rhodospirillales bacterium, ASM1865493v1
29 21 dBacteria;pVerrucomicrobiota GCA_903961625.1 uncultured Victivallales bacterium, freshwater MAG --- Loc090907-8-6m_bin-024
39 20 dBacteria;pAcidobacteriota;c__Acidobacteriae GCA_003224475.1 Acidobacteria bacterium, ASM322447v1
82 19 dBacteria;pChloroflexota;c__Dehalococcoidia GCA_002720365.1 Chloroflexi bacterium, ASM272036v1
96 19 dBacteria;pProteobacteria;cMagnetococcia;oMagnetococcales GCA_015231925.1 Magnetococcales bacterium, ASM1523192v1
20 18 dBacteria;pActinobacteriota;cActinomycetia;oMycobacteriales GCA_902805565.1 uncultured Corynebacteriales bacterium, AVDCRST-MAG41
36 18 dBacteria;pActinobacteriota;cCoriobacteriia;oCoriobacteriales GCA_900548495.1 uncultured Collinsella sp., UMGS1095
64 17 dBacteria;pFirmicutes_A;cClostridia;oAcetivibrionales GCF_000015865.1 Hungateiclostridium thermocellum ATCC 27405 strain=ATCC 27405, ASM1586v1
28 17 dBacteria;pVerrucomicrobiota;c__Verrucomicrobiae GCA_018667255.1 Opitutae bacterium, ASM1866725v1
86 16 dBacteria;pVerrucomicrobiota;cVerrucomicrobiae;oPedosphaerales GCA_016235585.1 Verrucomicrobia bacterium, ASM1623558v1
94 15 dBacteria;pMethylomirabilota;c__Methylomirabilia GCA_016187735.1 candidate division NC10 bacterium, ASM1618773v1
87 14 dBacteria;pFirmicutes;cBacilli;oPaenibacillales GCF_013337105.1 Paenibacillus sp. JMULE4 strain=JMULE4, ASM1333710v1
21 14 dBacteria;pFirmicutes;cBacilli;oErysipelotrichales GCA_900555595.1 uncultured Solobacterium sp., UMGS1844
83 14 dBacteria;pVerrucomicrobiota;cVerrucomicrobiae;oOpitutales GCA_018667255.1 Opitutae bacterium, ASM1866725v1
97 13 dBacteria;pCyanobacteria;cCyanobacteriia;oCyanobacteriales GCA_004294125.1 Oscillatoriales cyanobacterium, ASM429412v1
23 12 dBacteria;pFirmicutes_A;cClostridia;oChristensenellales GCA_017394825.1 Clostridia bacterium, ASM1739482v1
32 12 dBacteria;pActinobacteriota;cActinomycetia;oStreptomycetales GCF_015356865.1 Catenulispora pinisilvae strain=NH11, ASM1535686v1
101 12 dBacteria;pNitrospirota;cThermodesulfovibrionia;oUBA6902 GCA_011040095.1 Nitrospirae bacterium, ASM1104009v1
13 11 dBacteria;pFirmicutes;cBacilli;oLactobacillales GCF_018195745.1 Lactococcus lactis subsp. lactis strain=LEY6, ASM1819574v1
35 11 dBacteria;pAcidobacteriota GCA_002328195.1 Acidobacteria bacterium UBA2167, ASM232819v1
ctb commented 1 year ago

maybe: "longest hash chain" at that taxon?

ctb commented 1 year ago

ok, got a better way to do a breakdown of longest hash chain for specific taxa.

Per the table above, for d__Bacteria the longest hash chain is 364 hashes.

This hash chain is entirely part of GCA_001894475, dBacteria;pProteobacteria;cGammaproteobacteria;oEnterobacterales;fEnterobacteriaceae;gEscherichia;s__Escherichia coli,

which shares it across 165 partners - breakdown of top 10 partners and overlap below.

partner_ident partner_lin n_hashes
0 GCF_001481655 dBacteria;pBacteroidota;cBacteroidia;oFlavobacteriales;fFlavobacteriaceae;gFlavobacterium;s__Flavobacterium odoratimimum 46
1 GCF_012102505 dBacteria;pFirmicutes;cBacilli;oLactobacillales;fVagococcaceae;gVagococcus;s__Vagococcus fluvialis 20
2 GCF_003039915 dBacteria;pFirmicutes;cBacilli;oStaphylococcales;fStaphylococcaceae;gStaphylococcus;s__Staphylococcus cohnii 15
3 GCF_009020275 dBacteria;pBacteroidota;cBacteroidia;oBacteroidales;fBacteroidaceae;gBacteroides;s__Bacteroides uniformis 9
4 GCA_900758605 dBacteria;pBacteroidota;cBacteroidia;oBacteroidales;fBacteroidaceae;gBacteroides;s__Bacteroides sp900552405 9
5 GCF_013009155 dBacteria;pFirmicutes;cBacilli;oLactobacillales;fStreptococcaceae;gStreptococcus;s__Streptococcus suis_W 8
6 GCF_003311455 dBacteria;pFirmicutes;cBacilli;oStaphylococcales;fStaphylococcaceae;gStaphylococcus;s__Staphylococcus aureus 7
7 GCF_007293315 dBacteria;pFirmicutes;cBacilli;oBacillales_H;fBacillaceae_D;g__Alkalihalobacillus_A;sAlkalihalobacillus_A sp007293315 7
8 GCF_001865835 dBacteria;pBacteroidota;cBacteroidia;oFlavobacteriales;fFlavobacteriaceae;gFlavobacterium;s__Flavobacterium odoratimimum 6
9 GCF_009648365 dBacteria;pFirmicutes;cBacilli;oStaphylococcales;fStaphylococcaceae;gStaphylococcus;s__Staphylococcus epidermidis 6