amplab / snap

Scalable Nucleotide Alignment Program -- a fast and accurate read aligner for high-throughput sequencing data
https://www.microsoft.com/en-us/research/project/snap/
Apache License 2.0
287 stars 66 forks source link

Indel calling #154

Open kokyriakidis opened 2 years ago

kokyriakidis commented 2 years ago

Hi!

I want to detect accurately indels in some panel samples. I care only about small indels <=50b. Do you think I should increase "-d max edit distance" option to 50? Does this increase only affect snap speed or it also affects accuracy?

How does "-d max edit distance" compare with "-i max edit distance to considerfor potential indels"?

Are there any other options I should consider to increase sensitivity and precision around indels?

My first priority is sensitivity and accuracy and not speed.

KK

bolosky commented 2 years ago

If you want to find indels up to 50, then you should make -d a little bigger than 50 in case there are other differences in the read, like SNPs away from the indel. The max value for -d is 62 or 63 (depending on other stuff) so you have some slack here.

Using -d this big will slow SNAP down quite a bit if you happen to have a lot of reads that don't align at all (or align with high edit distance) but that have enough similarity to the reference to have many seed hits. You may or may not care about this and you can experiment with your data to see what happens.

What -i does is to look for potential indels in the seeding phase. That is, if it sees two seed hits that are close to one another but offset (which might indicate an indel in the read between the seeds) then it increases the max edit distance only for that alignment candidate. It will have a much smaller performance impact than -d, but it will miss an indel that doesn't have seed hits on either side of (because, for example, it's close to the end of the read or because the region between the indel and one end or the other has enough differences from the reference that it doesn't have an exact match that corresponds to a seed SNAP looked at). In truth, if indels are close to the end of the read they're likely to be soft clipped anyway unless you turn off soft clipping.

kokyriakidis commented 2 years ago

Thank you for your detailed answer!

So, to summarize.

  1. I should increase '-d' to 60 for example.
  2. Should I turn off soft clipping for better results? If yes, how do I do that? (I have already trimmed my data from adapters etc)

With these two things I will get the best snap performance for indels?

bolosky commented 2 years ago

I'd increase -d to 60 and see if you like the output. I'd also try -i 60, which will probably produce less noise since it will only increase the distance when it looks like there's an indel.

I think I spoke too soon about turning off soft clipping. We don't expose an option to do that, so you're stuck with it. So you're not likely to find big indels that are near the ends of reads, since they'll get clipped. That said, they're also pretty unreliable so you probably just want to stick with ones in the middle anyway.