DominikBuchner / BOLDigger

A python package to query different databases of boldsystems.org
MIT License
29 stars 4 forks source link

Discrepancy between BOLDigger output and BOLD's identification engine #27

Closed naurasd closed 1 year ago

naurasd commented 1 year ago

Hi,

I came a cross a weird by chance while going through my BOLDigger output file.

I have ~8,000 COI metabarcoding sequences which I classified with BOLDigger. I was using boldigger-cline v2.1.2 at that time, which was in July 2023. I opened this issue here because I dont think this is an issue specific to the commandline tool.

Below is are the top 20 hits for ASV17731:

grafik

When I manually check this sequence on BOLD's website against the All Barcode Records on BOLD database, I get the following nearest matches:

grafik

What btohers me is that these 20 matches in BOLD and BOLDigger are almost identical. But BOLDigger says this ASV has 87.38% similarity with the insect family Chironomidae, while on BOLD, this similarity value is attributed to a taxon of Ochrophyta.

I checked this now, in August 2023, so a month from the initial classification. But I honestly dont think that this has anything to do with this. Or am I wrong?

How can this sequence - according to BOLdigger - have the exact same similarity value for an insect family as well as an algae, while the former is not even listed in the output when I manually consult the BOLD identification engine?

This is the sequence in case you would like to reproduce the problem:

ASV17731 ATTATCATCTATTCAAGCGCATTCAGGGCCTTCAGTAGATATGGCGATTTTTAGTTTACATTTATCAGGTGCAGGTTCTATTTTAGGAGCAATTAATTTTATTGTAACTATCTTTAACATGCGTGCCCCAGGACTTTTCTTACATAAAATGCCTCTTTTTGTATGATCTGTATTAGTAACTGCATTTTTACTTTTATTATCTTTACCAGTTTTCGCTGGAGCAATTACTATGCTTTTAACAGATCGTAACTTTAATACAAGCTTTTATGATCCTGCCGGAGGAGGAGATCCAGTATTATACCAACATCTTTTC

Cheers

nauras

DominikBuchner commented 1 year ago

Can you please check if this issue persists once you figured out your versioning problems? Might be solved with one of the more recent updated versions!

naurasd commented 1 year ago

Yes, just had the same thought. I most likely didn't use v2.2.0 after all, but 1.0.0.

naurasd commented 1 year ago

Update: after updating boldigger-cline to v2.2.1, this issue has been solved.

I just tested this with a fasta file of 10 sequences containing the ASV in question. The top 20 hits now equal the top 20 hits when performing a manual identification on BOLD.