info long reads mapping - bbtools parameters and inquiring for suggestions

KristinaGagalova commented 1 month ago

Hi, I am circling back to Biobloomtools, hoping to use it for a contamination screening. Could you please provide more info on whether it's possible to use it for long reads and what the FPR and Kmer size would be in this case? Do you have any protocol and set of parameters to recommend? Thanks Kristina

JustinChu commented 3 weeks ago

I did some testing on long reads in 2020, with some results on a BCGSC Jira ticket. I think the results were pretty good and with modern Nanopore accuracy improved compared to back then I think it will also be even better. For Hifi reads it should also be even easier since you could probably keep -k pretty big or even benefit from increasing it to increase performance.

I think the main ingredient was that I used binomial score which is the default math for scoring for miBFs. I was considering making that the default scoring method for its robustness but didn't want to screw up any workflows expecting the older scoring methods.

So basically I think I tried something like:

-S binomial with -s from 40 to 100 (-s becomes the minimum -10*log(minimum fpr) threshold for a match). The parameter becomes like the minimum FPR, where the -10*log scaling thing so the math becomes similar MAPQ words in aligners like BWA. Like -s 10 would mean you accept 10% of the matches to be false positives and -s 60 would mean you accept 1 out of 10^6 reads to be false positives hits etc.
-k 19 for error compensating for the error rate. You might get away with something smaller but with the length of reads you have might compensate so you don't need to even keep this that low. Maybe even begin testing with default -k 25 might be fine if you already have filters.
Using DUST filtering --dust. I'm not an expert on DUST, but I integrated an off-the-shelf implementation of it to deal with low complexity and often repetitive sequences. I'm not too sure about the effect on performance though so another strategy I can think of is repeat masking your genome fasta file before filter creation which might do a similar thing without having to use DUST.

If you get good results post them here and I'll try to add something to the readme about using long reads. If you find it too slow with what parameters you find work best from a sensitivity & specificity perspective I think there is room for some easy optimizations based on a quick look at the code.

I think I'll at least try to put something in the readme about how binomial scoring when I have time at some point too now that I think about it.

lcoombe commented 3 weeks ago

Thanks so much for all that info, @JustinChu!

I found the JIRA ticket that I think you're referring to, and yes, looks like the parameters that you mentioned above are what you suggested back in 2020:

With the current master branch and the upcoming release (2.3.3) of BBT suggested ranges for options for long reads:

-k: 18 - 25 (k 25 may be useful because that is what the standard pipeline uses, though a smaller k maybe more sensitive)
-D (dust, low complexity filter)
-S: "binomial" (New scoring method)
-s: 60 - 100 (60 is 1/10^6 FPR, 100 is 1/10^10 FPR)

But as you say, the k-mer sizes you suggested were based on the technology back then

bcgsc / biobloom

info long reads mapping - bbtools parameters and inquiring for suggestions #89