No hits below 94% similarity

vanessamata commented 3 weeks ago

Hi Dominik!

Thanks for quickly updating BOLDigger to support BOLD V5!

I've noticed that now there seem to be a lot of "no matches" and when checking the hits there's no hits below 94% similarity, despite having selected database 3 (animal library, public + private) and mode 3 (Exhaustive Search). This issue seems to happen on the website as well, so I guess it is not a problem of BOLDigger itself, but rather some kind of bug with their identification engine. I have no idea when this will be fixed (i have already emailed them), so I wonder if BOLDigger (perhaps v2?) could keep accessing BOLD V4 instead? I'm trying to identify things from Africa, so a lot of the hits are below 94%. I managed to get hits when going to BOLD V4 with 90-93% similarity, but get "no match" on V5.

Thanks!!

Vanessa

ps: right now the identification engine of V4 seems to be down though... at least I haven't been able to get any result.

DominikBuchner commented 3 weeks ago

Hi Vanessa, the BOLD V4 api is also down, so without major updates to the code this is not easily doable. It might be possible to unlock different search parameters with some computer magic, I'll take a look. Until then our best bet is to contact BOLD regarding the issue. I believe it should be an easy fix for them, and also believe mode 3 and database 3 should at least go down to 85% as they state on the website.

€: Please let me know as soon as you get a response from them, this is really interesting.

vanessamata commented 3 weeks ago

I got a reply saying they would look into it and get back to me as soon as possible, and 20 minutes later they asked for an example of a no match and of something that would only report hits above 94%. I have provided example sequences this morning and now I am waiting for feedback :) fingers crossed that they solve the issue! :)

DominikBuchner commented 3 weeks ago

Perfect, thank you very much. When they fix it, BOLDigger3 will automatically adjust!

Anto007 commented 3 weeks ago

Hi @DominikBuchner

I've got a somewhat similar issue as reported here. Below are the results from BOLDigger3 for my COI ASVs and below are the results from BOLDigger2 (that was generated months ago) for these same COI ASVs Obviously, the number of "No matches" is high in the results from BOLDigger3 presumably due to the 85% similarity cutoff in mode 3 in BOLD v5. In BOLDigger2, different thresholds (97%: species level, 95%: genus level, 90%: family level, 85%: order level, <85% and >= 50: class level) for the taxonomic levels were used to find the best fitting hit but I guess going up to 50% identity to get class-level classifications is not going to be possible anymore when using BOLD v5 database?

DominikBuchner commented 3 weeks ago

Yes and I would not trust anything below 85% anyways, therefore I removed the lower threshold entirely. I think this is actually a good improvement by BOLD if only it was working.

Anto007 commented 3 weeks ago

Thanks for your response @DominikBuchner and I understand your point. However, a consequence of this will now be that a major chunk of our sequences from eDNA sequencing of poorly characterized environments is going to be reported as "Unclassified". I suppose the choice to go for a conservative approach or a liberal approach with respect to assigning taxonomy might be considered somewhat subjective and context-dependent.

DominikBuchner commented 3 weeks ago

I think if we actually get 85%+ soon this will be much less of an issue.

Anto007 commented 3 weeks ago

But my BOLDigger3 results above are from this morning and they have clearly taken hits > 85%. I didn't see a <94% problem at least in my results

DominikBuchner commented 3 weeks ago

Can you send the file with all results to my working mail address? I'd like to have a look, because I was able to reproduce the issue described!

Anto007 commented 3 weeks ago

Sorry, I'm unable to follow you. You mean to say you too got results files that did not have results <94%? My results sheet from this morning has got plenty of hits at around 85% and I had used boldigger3-1.1.2 (which I've upgraded further to 3-1.1.4 at this very minute but I'm yet to test this upgraded version)

DominikBuchner commented 3 weeks ago

So you get results between 0.85% and 0.94%? That would mean that they fixed it immediately

Jant007 @.***> schrieb am Do., 31. Okt. 2024, 14:56:

Sorry, I'm unable to follow you. You mean to say you too got results files that did not have results <94%? My results sheet from this morning has got plenty of hits > 85% and I had used boldigger3-1.1.2 (which I've upgraded further to 3-1.1.4 at this very minute but I'm yet to test this upgraded version)

— Reply to this email directly, view it on GitHub https://github.com/DominikBuchner/BOLDigger3/issues/5#issuecomment-2449905129, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJH6ILGE6BIYPLO75LTTKATZ6IZJ7AVCNFSM6AAAAABQ4TWAZOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINBZHEYDKMJSHE . You are receiving this because you were mentioned.Message ID: @.***>

vanessamata commented 3 weeks ago

perhaps they fixed it already? i will try again to see if is solved!

Anto007 commented 3 weeks ago

Yes, I didn't notice a <94% identity problem in my BOLDigger3 results from this morning's run (for example, see the 88.854% identity result for my ASV_9 in my earlier posted BOLDigger3 screenshot in this thread). I didn't get a chance to run BOLDigger3 yet again this day and so don't know if something broke at the BOLD server end for example later in the afternoon.

DominikBuchner commented 3 weeks ago

Hi all, short update on that matter:

The bug is not fixed yet and I can reproduce it via the website. I cannot get any pct_id value between 0.85 and 0.94
@Anto007: It would be great if you could share the fasta you posted above with me so I can investigate this further because I did not find a single case with over 10k sequences. It does not seem to be a bug in BOLDigger but the ID engine itself
Upon further investigation, it turns out that, when you know how to, you can freely choose the search parameters, e.g. reduce the pct_id to 70%, get 500 hits for each sequence, etc. I don't believe BOLD intends this so I won't implement it for now but contacted them to report the bug.

TLDR: No hits below 0.94% for now, I'm in contact with BOLD.

DominikBuchner commented 3 weeks ago

Update: BOLD is aware of the problem and actively working to fix it.

Anto007 commented 3 weeks ago

@DominikBuchner That's strange- perhaps, I got lucky then. Here are my input ASVs fasta file and the identification results that I got from BOLDigger3. Test_ASVs_identification_result.xlsx Test_ASVs.fa.txt There are 7033 ASV sequences in total and you'll notice that many of them indeed are <90% identity hits. You seem to have closed this issue but have the BOLD admins communicated to you that that they have fixed this issue? Just curious

DominikBuchner commented 3 weeks ago

Not fixed yet, but it seems to be a minor issue. What was the operating mode and db for your fasta?

Anto007 commented 3 weeks ago

--db 3 and --mode 3. The total run time was an impressive 3 hours.

DominikBuchner commented 3 weeks ago

This is really strange, I can confirm that with your .fasta I can also get results >85 > 94% but was not able to reproduce this with any other file! Can you tell me the operating system this file was produced on?

Anto007 commented 3 weeks ago

@DominikBuchner It was generated on an Ubuntu 20.04 LTS OS after running dada2 and some final ASV filtering steps such as removal of NuMTs, non-Eukaryota ASVs and so on.

DominikBuchner commented 3 weeks ago

Hm okay, I officially don't get it. Let's see if I get a positive response from BOLD, so far, it does not seem to be fixed (except for Anto's file :D)

Anto007 commented 3 weeks ago

Very odd..What if perhaps you made a new hybrid test fasta file containing, for example, 10 of my sequences and 10 of yours?

DominikBuchner commented 3 weeks ago

Maybe a good idea. Will test tomorrow.

vanessamata commented 3 weeks ago

interesting... I've tried re-running singles sequences on the website and I still get the same issue, no matches or only matches >94%. It's very odd that for a specific fasta files it does provide results >85%... I am very confused...! I haven't had any feedback from BOLD since I provided example sequences :(

DominikBuchner commented 2 weeks ago

So I got feedback from BOLD: They say the website is working as intended and there is no bug. I'm as confused as you are, but will perform further tests. They will publish an API around January/February which will speed up the whole process once implemented into BOLDigger3. I'll keep you updated, the other bug I reported was resolved today, so no more "unavailable" process IDs. I believe that it has sth. to do with the formatting of Anto's file and will do some bug-testing with the same data in different formats. Will keep you updated.

vanessamata commented 2 weeks ago

their website is working so well that I have been waiting for over 10 minutes for a single sequence with no luck... boldigger3 also seems to be stuck. Maybe their identification engine is down?

V4 seems to be working again fine though xD would it be difficult to update boldigger2 to use the new address of V4?

DominikBuchner commented 2 weeks ago

I'll check the options here. I believe the old bold API is down, but will check tomorrow! Really sorry about the mess this is causing, no advertisement for genetic methods tbh.

Regarding v5: just waiting helps a lot :D

vanessamata @.***> schrieb am Mo., 4. Nov. 2024, 16:36:

their website is working so well that I have been waiting for over 10 minutes for a single sequence with no luck... boldigger3 also seems to be stuck. Maybe their identification engine is down?

V4 seems to be working again fine though xD would it be difficult to update boldigger2 to use the new address of V4?

— Reply to this email directly, view it on GitHub https://github.com/DominikBuchner/BOLDigger3/issues/5#issuecomment-2455034982, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJH6ILC2OIRVD5B7MDTUHDDZ66IAJAVCNFSM6AAAAABQ4TWAZOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINJVGAZTIOJYGI . You are receiving this because you were mentioned.Message ID: @.***>

DominikBuchner commented 2 weeks ago

It appears to be the length of the sequence. Sequences shorter than 225 bp produce this bug. I reported it and wait for a response.

Anto007 commented 2 weeks ago

Ohhh...weird but great to know that you were finally able to identify the bug here

vanessamata commented 2 days ago

@DominikBuchner I got the following answer from BOLD today:

"We had our team look into this and they concluded that the reason it doesn't show up is that it doesn't meet the other parameters for a valid match. Specifically, the parameter overlap is too low (query seq length - gaps) for certain match configurations even though results may reach 90%-93% identification. We may re-examine these parameters in the future, but for now we suggest using v4 for your analysis in this case."

This is a big bummer for people working with degraded DNA (small sequences), like dietary analysis and environmental DNA. I know it's probably annoying, but can you update boldigger2 so it functions with v4?

DominikBuchner commented 2 days ago

Hi Vanessa, I'm not sure I got this correctly. If we find out h ow this overlap parameter works I can probably tweak it in a way that lower matches are accepted. I agree that this sucks, but backporting the code to v4 a) makes no sense, since it will be taken down soon b) does not work, since they shut down the API that I heavily used in BOLDigger2.

So, to move forward: Let's find out how this overlap parameter works, then maybe we can tweak BOLDigger3. Will take a look into this!

€: I believe the new search engine uses blast in the background, so I will have to do some digging here since I'm not a bioinformatician by trade ;)

vanessamata commented 2 days ago

Well, if you can figure out how to bypass that parameter, that would be great! :D I assumed you couldn't really tweak their algorithm...! But if they are using blast in the background, maybe that's possible...? I wish I could be more of help instead of crying for help eheh so thanks for your effort and nonetheless very useful non-bioinformatician skills ;)

DominikBuchner commented 2 days ago

At least it is quite easy to change the max hits returned and also go down with similarity values to 0.8. SO I guess I can also reverse engineer how to change the minimum overlap.

vanessamata commented 2 days ago

🤞🤞🤞

DominikBuchner commented 2 days ago

Okay, the overlap parameters can be changed, however I did not find out how this affects the results (it does change them, but I don't have sufficient time to do robust testing atm). I have to finish off a few other urgent things first, then I'll come back to this!

DominikBuchner / BOLDigger3

No hits below 94% similarity #5