bcgsc / biobloom

Create Bloom filters for a given reference and then use it to categorize sequences
http://www.bcgsc.ca/platform/bioinfo/software/biobloomtools
GNU General Public License v3.0
76 stars 15 forks source link

info long reads mapping - bbtools parameters and inquiring for suggestions #89

Open KristinaGagalova opened 1 month ago

KristinaGagalova commented 1 month ago

Hi, I am circling back to Biobloomtools, hoping to use it for a contamination screening. Could you please provide more info on whether it's possible to use it for long reads and what the FPR and Kmer size would be in this case? Do you have any protocol and set of parameters to recommend? Thanks Kristina

JustinChu commented 3 weeks ago

I did some testing on long reads in 2020, with some results on a BCGSC Jira ticket. I think the results were pretty good and with modern Nanopore accuracy improved compared to back then I think it will also be even better. For Hifi reads it should also be even easier since you could probably keep -k pretty big or even benefit from increasing it to increase performance.

I think the main ingredient was that I used binomial score which is the default math for scoring for miBFs. I was considering making that the default scoring method for its robustness but didn't want to screw up any workflows expecting the older scoring methods.

So basically I think I tried something like:

If you get good results post them here and I'll try to add something to the readme about using long reads. If you find it too slow with what parameters you find work best from a sensitivity & specificity perspective I think there is room for some easy optimizations based on a quick look at the code.

I think I'll at least try to put something in the readme about how binomial scoring when I have time at some point too now that I think about it.

lcoombe commented 3 weeks ago

Thanks so much for all that info, @JustinChu!

I found the JIRA ticket that I think you're referring to, and yes, looks like the parameters that you mentioned above are what you suggested back in 2020:

With the current master branch and the upcoming release (2.3.3) of BBT suggested ranges for options for long reads:

-k: 18 - 25 (k 25 may be useful because that is what the standard pipeline uses, though a smaller k maybe more sensitive)
-D (dust, low complexity filter)
-S: "binomial" (New scoring method)
-s: 60 - 100 (60 is 1/10^6 FPR, 100 is 1/10^10 FPR)

But as you say, the k-mer sizes you suggested were based on the technology back then