DominikBuchner / BOLDigger3

MIT License
2 stars 0 forks source link

Actual thresholds differ from specified by 1%pt. #16

Closed hjarnek closed 4 days ago

hjarnek commented 5 days ago

I get a lot of classifications that are either overclassified or (more rarely) underclassified according to the given thresholds. The underclassified cases I guess are due to incomplete taxonomic information of the matched reference sequences (e.g. >97% similarity, but only classified to genus).

The overclassified taxa are harder to explain though. With the default thresholds (97%: species level, 95%: genus level, 90%: family level, 85%: order level), I get over 200 overclassifications, both to species, genus, and family. I.e. cases where at least the last available level of taxonomic information should have been discarded. As far as I can tell, the % identity in these cases is within one percentage point below the threshold that should have been applied, i.e. either >96%, >94% or >89%.

I wrote an R script to correct this, but it would be good if you could have a look at it as well. Here is some example output:

Click to expand ``` Species information dropped for ASV46169 (pct_identity = 96.907) Genus information dropped for ASV46218 (pct_identity = 94.757) Genus information dropped for ASV46347 (pct_identity = 94.139) Genus information dropped for ASV46394 (pct_identity = 94.463) Genus information dropped for ASV46444 (pct_identity = 94.961) Species information dropped for ASV46634 (pct_identity = 96.525) Genus information dropped for ASV46906 (pct_identity = 94) Species information dropped for ASV46915 (pct_identity = 96.863) Species information dropped for ASV47079 (pct_identity = 96.099) Species information dropped for ASV47212 (pct_identity = 96.8) Genus information dropped for ASV47244 (pct_identity = 94) Species information dropped for ASV47248 (pct_identity = 96.233) Genus information dropped for ASV47507 (pct_identity = 94.074) Species information dropped for ASV47533 (pct_identity = 96.667) Species information dropped for ASV47747 (pct_identity = 96.831) Family information dropped for ASV47933 (pct_identity = 89.919) Species information dropped for ASV47966 (pct_identity = 96.269) Species information dropped for ASV48534 (pct_identity = 96.813) Species information dropped for ASV48558 (pct_identity = 96.416) Species information dropped for ASV48584 (pct_identity = 96.078) ```
DominikBuchner commented 5 days ago

Can you please upload the underlying data, so I can replicate this without having to download?

€: I believe this is because BOLD does not send back the results sorted by similarity in all cases. Another bug I'll have to still report. Will do the sorting in BOLDigger then.

Joel Hjärne Kokk @.***> schrieb am Di., 12. Nov. 2024, 22:50:

I get a lot of classifications that are either overclassified or (more rarely) underclassified according to the given thresholds. The underclassified cases I guess are due to incomplete taxonomic information of the matched reference sequences (e.g. >97% similarity, but only classified to genus).

The overclassified taxa are harder to explain though. With the default thresholds (97%: species level, 95%: genus level, 90%: family level, 85%: order level), I get over 200 overclassifications, both to species, genus, and family. I.e. cases where at least the last available level of taxonomic information should have been discarded. As far as I can tell, the % identity in these cases is within one percentage point below the threshold that should have been applied, i.e. either >96%, >94% or >89%.

I wrote an R script to correct this, but it would be good if you could have a look at it as well. Here is some example output: Click to expand

Species information dropped for ASV46169 (pct_identity = 96.907) Genus information dropped for ASV46218 (pct_identity = 94.757) Genus information dropped for ASV46347 (pct_identity = 94.139) Genus information dropped for ASV46394 (pct_identity = 94.463) Genus information dropped for ASV46444 (pct_identity = 94.961) Species information dropped for ASV46634 (pct_identity = 96.525) Genus information dropped for ASV46906 (pct_identity = 94) Species information dropped for ASV46915 (pct_identity = 96.863) Species information dropped for ASV47079 (pct_identity = 96.099) Species information dropped for ASV47212 (pct_identity = 96.8) Genus information dropped for ASV47244 (pct_identity = 94) Species information dropped for ASV47248 (pct_identity = 96.233) Genus information dropped for ASV47507 (pct_identity = 94.074) Species information dropped for ASV47533 (pct_identity = 96.667) Species information dropped for ASV47747 (pct_identity = 96.831) Family information dropped for ASV47933 (pct_identity = 89.919) Species information dropped for ASV47966 (pct_identity = 96.269) Species information dropped for ASV48534 (pct_identity = 96.813) Species information dropped for ASV48558 (pct_identity = 96.416) Species information dropped for ASV48584 (pct_identity = 96.078)

— Reply to this email directly, view it on GitHub https://github.com/DominikBuchner/BOLDigger3/issues/16, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJH6ILA3SOT2MDA23BZQZF32AJZZPAVCNFSM6AAAAABRVBCRIKVHI2DSMVQWIX3LMV43ASLTON2WKOZSGY2TGNBSGA4DCMQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

hjarnek commented 5 days ago

Alright, I see.

Here are some data for reference:

>ASV40920
TTTAAGAAGATCTATCGCCCATAGAGGAGGAGCTGTGGACCTTGCTATTTTTTCACTACACCTGGCAGGTGCTTCTTCTATCTTAGGGGCTATTAATTTTATTTCTACTGTAATTAATATACGATCCACGAATATATATATAAGACGAGTGCCTTTATTTGTTTGATCAGTCTTTATCACTGCTATTTTACTACTTTTATCTCTCCCTGTCTTAGCAGGAGCTATTACTATACCTCTTACAGATCGAAATTTAAACACATCATTCTTTGACCCTACAGGGGGAGGAGACCCTATTTTATACCAACACCTATTT
>ASV45297
ATTATCAGGCCCACAAATGCATTCAGGTGGGTCAGTAGATATGGCTATATTTAGTTTACATTGTGCTGGTGCTTCCTCAACTATGGGTGCTATAAATTTTATAACAACTATAATTAATATGAGAGCCCCAGGAATGACTTTTGATAAGTTACCTTTATTTGTATGATCAGTACTTATAACTGCATTCTTACTTCTTCTTTCTTTACCAGTATTAGCAGGTGCTATAACTATGCTCTTAACAGATCGTAATTTTAACACTACATTTTTTGATCCGGCAGGTGGGGGTGATCCAGTATTATACCAACACTTATTC
>ASV45472
GCTGAGCTCTACCTTGGCACACAGGGGTGGCGCAGTAGATTTAGCAATTTTTTCTCTACATTTGGCAGGTGCTTCTTCAATTTTAGGGGCTATCAATTTTATTTCTACTGTTATTAATATACGAGCCAAAGGTATATATATAGAACGTGTATCTCTTTTTGTTTGATCCGTATTTATTACCGCTATTTTACTACTTCTCTCATTGCCTGTTTTAGCTGAGGCTATTACTATACTTCTCACTGACCGAAATTTTAACACGACTTTTTTTGACCCCGTGGGGGGAGGGGATCCAATTTTATACCAGCATCTTTTC
>ASV45553
ACTTTCTAGCAATCTTGCTCATGCGGGAGGATCTGTAGACTTAGCTATTTTTTCTTTACATTTAGCAGGTGTTTCTTCTATTCTTGGAGCCGTAAACTTTATTACAACTATTATCAATATACGATGACGAGGAATGCAGTTCGAACGGCTTCCGTTGTTTGTTTGATCTGTAAAAATTACTGCCATTTTATTATTATTGTCACTACCTGTCTTGGCAGGTGCGATTACCATACTTTTAACGGATCGAAATATCAATACATCTTTTTTTGACCCTTTAGGAGGGGGAGACCCTATCCTATACCAACATTTATTT
>ASV46915
TCTTTCTGGTCCCATGGGCCACGGGGGTTGTTCTGTGGACCTCGCAATTTTTTCCCTCCATTTAGCAGGTATGTCTTCTTTACTAGGGGCTATTAATTTTATTACGACTATTTTCAATATGCGGTCTCCCGAGATAACTTGAGATCGGATAAGATTATTTGTTTGATCTGTTCTAGTGACAGCATTTCTATTGCTTTTATCTCTTCCTGTGTTGGCTGGGGCTATTACTATGCTACTAACCGACCGTAACTTTAACACCTCGTTCTTTGACCCTGCTGGTGGTGGAGACCCTGTTCTTTACCAACACCTGTTC
>ASV47244
TTTATCAGGTTCTCAAACACATTCAGGAGGAGCAGTAGATATGGCTATTTTTAGTTTACATTGTGCAGGAGCTTCTTCTATTATGGGAGCTATAAATTTTATAACTACCATATTTAATATGAGAGCCCCTGGATTAACACTAGATAAATTACCTTTATTTGTCTGATCTGTATTAATCACTGCTTTCTTATTATTATTATCTCTACCTGTATTAGCAGGAGCCATAACAATGTTATTAATCGACAGAAATTTTAATACTACATTTTTTGATCCTGCAGGAGGAGGAGATCCGGTACTATATCAACATTTATTT
>ASV47933
TTTATCAGGTTCTCAAACACATTCAGGAGGAGCAGTAGATATGGCTATTTTTAGTTTACATTGTGCAGGAGCTTCTTCTATTATGGGAGCTATAAATTTTATAACTACCATATTTAATATGAGAGCTATTGGTTTATACATGCATAGATTACCTTTATTTGTTTGGGCTGTTTTAATAACAGCAGTTTTATTGTTATTATCTTTACCTGTATTAGCTGGGGCTATAACTATGTTGTTAACTGACAGAGCTTTCGGTACATTATTTTATAACAGTGCTGGGGGTGGTGATCCTGTATTATATCAACATTTATTT
id Phylum Class Order Family Genus Species pct_identity status records selected_level BIN flags
ASV40920 Arthropoda Malacostraca Amphipoda Corophiidae Monocorophium Monocorophium insidiosum 96 public 1 Species BOLD:AAE1628;BOLD:AAE9749;BOLD:AFS1096 5
ASV45297 Cnidaria Hydrozoa Anthoathecata Pandeidae Leuckartiara Leuckartiara octona 96.907 private 3 Species BOLD:ACQ8293
ASV45472 Arthropoda Malacostraca Amphipoda Aoridae Microdeutopus 94.158 public 1 Genus
ASV45553 Mollusca Gastropoda Littorinimorpha Hydrobiidae Hydrobia Hydrobia ulvae 96.667 public 1 Species BOLD:AAA7911
ASV46915 Mollusca Gastropoda Nudibranchia Eubranchidae Eubranchus Eubranchus exiguus 96.863 public 1 Species BOLD:AAE1900;BOLD:ADZ4129 5
ASV47244 Cnidaria Hydrozoa Leptothecata Campanulariidae Obelia 94 private 3 Genus
ASV47933 Cnidaria Hydrozoa Leptothecata Campanulariidae 89.919 public 1 Family
DominikBuchner commented 4 days ago

Hi there @hjarnek , can you quickly validate that this is the expected output for your example: <html xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

id | Phylum | Class | Order | Family | Genus | Species | pct_identity | status | records | selected_level | BIN | flags -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- ASV40920 | Arthropoda | Malacostraca | Amphipoda | Corophiidae | Monocorophium | Monocorophium insidiosum | 98.515 | public | 1 | Species | BOLD:AAE9749;BOLD:AAE1628;BOLD:AFS1096 | 5 ASV45297 | Cnidaria | Hydrozoa | Anthoathecata | Pandeidae | Leuckartiara | Leuckartiara octona | 98.773 | public | 3 | Species | BOLD:ACQ8293 ASV45472 | Arthropoda | Malacostraca | Amphipoda | Aoridae | Microdeutopus | 95.604 | public | 1 | Genus |   |   ASV45553 | Mollusca | Gastropoda | Littorinimorpha | Hydrobiidae | Hydrobia | Hydrobia ulvae | 98.394 | public | 1 | Species | BOLD:AAA7911 ASV46915 | Mollusca | Gastropoda | Nudibranchia | Eubranchidae | Eubranchus | Eubranchus exiguus | 99.408 | public | 1 | Species | BOLD:AAE1900;BOLD:ADZ4129 | 5 ASV47244 | Cnidaria | Hydrozoa | Leptothecata | Campanulariidae | Obelia |   | 95.745 | private | 3 | Genus |   |   ASV47933 | Cnidaria | Hydrozoa | Leptothecata | Campanulariidae |   | 90.213 | private | 1 | Family |   |  

DominikBuchner commented 4 days ago

I believe not the dropping of information was implemented incorrectly, but the sorting of the hits for the specific ID actually did cause problems here.

DominikBuchner commented 4 days ago

Fixed with 1.2.2. You can repair your output by running boldigger again on the original fasta. All steps except the selection of top hits will be omitted.

hjarnek commented 3 days ago

Yes, can confirm I get the same output as you, as expected. Thanks for the quick fix!