Closed hjarnek closed 4 days ago
Can you please upload the underlying data, so I can replicate this without having to download?
€: I believe this is because BOLD does not send back the results sorted by similarity in all cases. Another bug I'll have to still report. Will do the sorting in BOLDigger then.
Joel Hjärne Kokk @.***> schrieb am Di., 12. Nov. 2024, 22:50:
I get a lot of classifications that are either overclassified or (more rarely) underclassified according to the given thresholds. The underclassified cases I guess are due to incomplete taxonomic information of the matched reference sequences (e.g. >97% similarity, but only classified to genus).
The overclassified taxa are harder to explain though. With the default thresholds (97%: species level, 95%: genus level, 90%: family level, 85%: order level), I get over 200 overclassifications, both to species, genus, and family. I.e. cases where at least the last available level of taxonomic information should have been discarded. As far as I can tell, the % identity in these cases is within one percentage point below the threshold that should have been applied, i.e. either >96%, >94% or >89%.
I wrote an R script to correct this, but it would be good if you could have a look at it as well. Here is some example output: Click to expand
Species information dropped for ASV46169 (pct_identity = 96.907) Genus information dropped for ASV46218 (pct_identity = 94.757) Genus information dropped for ASV46347 (pct_identity = 94.139) Genus information dropped for ASV46394 (pct_identity = 94.463) Genus information dropped for ASV46444 (pct_identity = 94.961) Species information dropped for ASV46634 (pct_identity = 96.525) Genus information dropped for ASV46906 (pct_identity = 94) Species information dropped for ASV46915 (pct_identity = 96.863) Species information dropped for ASV47079 (pct_identity = 96.099) Species information dropped for ASV47212 (pct_identity = 96.8) Genus information dropped for ASV47244 (pct_identity = 94) Species information dropped for ASV47248 (pct_identity = 96.233) Genus information dropped for ASV47507 (pct_identity = 94.074) Species information dropped for ASV47533 (pct_identity = 96.667) Species information dropped for ASV47747 (pct_identity = 96.831) Family information dropped for ASV47933 (pct_identity = 89.919) Species information dropped for ASV47966 (pct_identity = 96.269) Species information dropped for ASV48534 (pct_identity = 96.813) Species information dropped for ASV48558 (pct_identity = 96.416) Species information dropped for ASV48584 (pct_identity = 96.078)
— Reply to this email directly, view it on GitHub https://github.com/DominikBuchner/BOLDigger3/issues/16, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJH6ILA3SOT2MDA23BZQZF32AJZZPAVCNFSM6AAAAABRVBCRIKVHI2DSMVQWIX3LMV43ASLTON2WKOZSGY2TGNBSGA4DCMQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>
Alright, I see.
Here are some data for reference:
>ASV40920
TTTAAGAAGATCTATCGCCCATAGAGGAGGAGCTGTGGACCTTGCTATTTTTTCACTACACCTGGCAGGTGCTTCTTCTATCTTAGGGGCTATTAATTTTATTTCTACTGTAATTAATATACGATCCACGAATATATATATAAGACGAGTGCCTTTATTTGTTTGATCAGTCTTTATCACTGCTATTTTACTACTTTTATCTCTCCCTGTCTTAGCAGGAGCTATTACTATACCTCTTACAGATCGAAATTTAAACACATCATTCTTTGACCCTACAGGGGGAGGAGACCCTATTTTATACCAACACCTATTT
>ASV45297
ATTATCAGGCCCACAAATGCATTCAGGTGGGTCAGTAGATATGGCTATATTTAGTTTACATTGTGCTGGTGCTTCCTCAACTATGGGTGCTATAAATTTTATAACAACTATAATTAATATGAGAGCCCCAGGAATGACTTTTGATAAGTTACCTTTATTTGTATGATCAGTACTTATAACTGCATTCTTACTTCTTCTTTCTTTACCAGTATTAGCAGGTGCTATAACTATGCTCTTAACAGATCGTAATTTTAACACTACATTTTTTGATCCGGCAGGTGGGGGTGATCCAGTATTATACCAACACTTATTC
>ASV45472
GCTGAGCTCTACCTTGGCACACAGGGGTGGCGCAGTAGATTTAGCAATTTTTTCTCTACATTTGGCAGGTGCTTCTTCAATTTTAGGGGCTATCAATTTTATTTCTACTGTTATTAATATACGAGCCAAAGGTATATATATAGAACGTGTATCTCTTTTTGTTTGATCCGTATTTATTACCGCTATTTTACTACTTCTCTCATTGCCTGTTTTAGCTGAGGCTATTACTATACTTCTCACTGACCGAAATTTTAACACGACTTTTTTTGACCCCGTGGGGGGAGGGGATCCAATTTTATACCAGCATCTTTTC
>ASV45553
ACTTTCTAGCAATCTTGCTCATGCGGGAGGATCTGTAGACTTAGCTATTTTTTCTTTACATTTAGCAGGTGTTTCTTCTATTCTTGGAGCCGTAAACTTTATTACAACTATTATCAATATACGATGACGAGGAATGCAGTTCGAACGGCTTCCGTTGTTTGTTTGATCTGTAAAAATTACTGCCATTTTATTATTATTGTCACTACCTGTCTTGGCAGGTGCGATTACCATACTTTTAACGGATCGAAATATCAATACATCTTTTTTTGACCCTTTAGGAGGGGGAGACCCTATCCTATACCAACATTTATTT
>ASV46915
TCTTTCTGGTCCCATGGGCCACGGGGGTTGTTCTGTGGACCTCGCAATTTTTTCCCTCCATTTAGCAGGTATGTCTTCTTTACTAGGGGCTATTAATTTTATTACGACTATTTTCAATATGCGGTCTCCCGAGATAACTTGAGATCGGATAAGATTATTTGTTTGATCTGTTCTAGTGACAGCATTTCTATTGCTTTTATCTCTTCCTGTGTTGGCTGGGGCTATTACTATGCTACTAACCGACCGTAACTTTAACACCTCGTTCTTTGACCCTGCTGGTGGTGGAGACCCTGTTCTTTACCAACACCTGTTC
>ASV47244
TTTATCAGGTTCTCAAACACATTCAGGAGGAGCAGTAGATATGGCTATTTTTAGTTTACATTGTGCAGGAGCTTCTTCTATTATGGGAGCTATAAATTTTATAACTACCATATTTAATATGAGAGCCCCTGGATTAACACTAGATAAATTACCTTTATTTGTCTGATCTGTATTAATCACTGCTTTCTTATTATTATTATCTCTACCTGTATTAGCAGGAGCCATAACAATGTTATTAATCGACAGAAATTTTAATACTACATTTTTTGATCCTGCAGGAGGAGGAGATCCGGTACTATATCAACATTTATTT
>ASV47933
TTTATCAGGTTCTCAAACACATTCAGGAGGAGCAGTAGATATGGCTATTTTTAGTTTACATTGTGCAGGAGCTTCTTCTATTATGGGAGCTATAAATTTTATAACTACCATATTTAATATGAGAGCTATTGGTTTATACATGCATAGATTACCTTTATTTGTTTGGGCTGTTTTAATAACAGCAGTTTTATTGTTATTATCTTTACCTGTATTAGCTGGGGCTATAACTATGTTGTTAACTGACAGAGCTTTCGGTACATTATTTTATAACAGTGCTGGGGGTGGTGATCCTGTATTATATCAACATTTATTT
id | Phylum | Class | Order | Family | Genus | Species | pct_identity | status | records | selected_level | BIN | flags |
---|---|---|---|---|---|---|---|---|---|---|---|---|
ASV40920 | Arthropoda | Malacostraca | Amphipoda | Corophiidae | Monocorophium | Monocorophium insidiosum | 96 | public | 1 | Species | BOLD:AAE1628;BOLD:AAE9749;BOLD:AFS1096 | 5 |
ASV45297 | Cnidaria | Hydrozoa | Anthoathecata | Pandeidae | Leuckartiara | Leuckartiara octona | 96.907 | private | 3 | Species | BOLD:ACQ8293 | |
ASV45472 | Arthropoda | Malacostraca | Amphipoda | Aoridae | Microdeutopus | 94.158 | public | 1 | Genus | |||
ASV45553 | Mollusca | Gastropoda | Littorinimorpha | Hydrobiidae | Hydrobia | Hydrobia ulvae | 96.667 | public | 1 | Species | BOLD:AAA7911 | |
ASV46915 | Mollusca | Gastropoda | Nudibranchia | Eubranchidae | Eubranchus | Eubranchus exiguus | 96.863 | public | 1 | Species | BOLD:AAE1900;BOLD:ADZ4129 | 5 |
ASV47244 | Cnidaria | Hydrozoa | Leptothecata | Campanulariidae | Obelia | 94 | private | 3 | Genus | |||
ASV47933 | Cnidaria | Hydrozoa | Leptothecata | Campanulariidae | 89.919 | public | 1 | Family |
Hi there @hjarnek , can you quickly validate that this is the expected output for your example: <html xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">
id | Phylum | Class | Order | Family | Genus | Species | pct_identity | status | records | selected_level | BIN | flags -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- ASV40920 | Arthropoda | Malacostraca | Amphipoda | Corophiidae | Monocorophium | Monocorophium insidiosum | 98.515 | public | 1 | Species | BOLD:AAE9749;BOLD:AAE1628;BOLD:AFS1096 | 5 ASV45297 | Cnidaria | Hydrozoa | Anthoathecata | Pandeidae | Leuckartiara | Leuckartiara octona | 98.773 | public | 3 | Species | BOLD:ACQ8293 ASV45472 | Arthropoda | Malacostraca | Amphipoda | Aoridae | Microdeutopus | 95.604 | public | 1 | Genus | | ASV45553 | Mollusca | Gastropoda | Littorinimorpha | Hydrobiidae | Hydrobia | Hydrobia ulvae | 98.394 | public | 1 | Species | BOLD:AAA7911 ASV46915 | Mollusca | Gastropoda | Nudibranchia | Eubranchidae | Eubranchus | Eubranchus exiguus | 99.408 | public | 1 | Species | BOLD:AAE1900;BOLD:ADZ4129 | 5 ASV47244 | Cnidaria | Hydrozoa | Leptothecata | Campanulariidae | Obelia | | 95.745 | private | 3 | Genus | | ASV47933 | Cnidaria | Hydrozoa | Leptothecata | Campanulariidae | | 90.213 | private | 1 | Family | |
I get a lot of classifications that are either overclassified or (more rarely) underclassified according to the given thresholds. The underclassified cases I guess are due to incomplete taxonomic information of the matched reference sequences (e.g. >97% similarity, but only classified to genus).
The overclassified taxa are harder to explain though. With the default thresholds (97%: species level, 95%: genus level, 90%: family level, 85%: order level), I get over 200 overclassifications, both to species, genus, and family. I.e. cases where at least the last available level of taxonomic information should have been discarded. As far as I can tell, the % identity in these cases is within one percentage point below the threshold that should have been applied, i.e. either >96%, >94% or >89%.
I wrote an R script to correct this, but it would be good if you could have a look at it as well. Here is some example output:
Click to expand
``` Species information dropped for ASV46169 (pct_identity = 96.907) Genus information dropped for ASV46218 (pct_identity = 94.757) Genus information dropped for ASV46347 (pct_identity = 94.139) Genus information dropped for ASV46394 (pct_identity = 94.463) Genus information dropped for ASV46444 (pct_identity = 94.961) Species information dropped for ASV46634 (pct_identity = 96.525) Genus information dropped for ASV46906 (pct_identity = 94) Species information dropped for ASV46915 (pct_identity = 96.863) Species information dropped for ASV47079 (pct_identity = 96.099) Species information dropped for ASV47212 (pct_identity = 96.8) Genus information dropped for ASV47244 (pct_identity = 94) Species information dropped for ASV47248 (pct_identity = 96.233) Genus information dropped for ASV47507 (pct_identity = 94.074) Species information dropped for ASV47533 (pct_identity = 96.667) Species information dropped for ASV47747 (pct_identity = 96.831) Family information dropped for ASV47933 (pct_identity = 89.919) Species information dropped for ASV47966 (pct_identity = 96.269) Species information dropped for ASV48534 (pct_identity = 96.813) Species information dropped for ASV48558 (pct_identity = 96.416) Species information dropped for ASV48584 (pct_identity = 96.078) ```