fritzsedlazeck / Sniffles

Structural variation caller using third generation sequencing
Other
559 stars 93 forks source link

Difference between SUPPORT, DV and RNAMES #496

Open kamilafras opened 3 months ago

kamilafras commented 3 months ago

Hello, Ive been using the ngmlr v0.2.7 aligner along with Sniffles v2.2 on parasite genomes. I generate a variant file but I am struggling to understand the difference between some of fields. For an example, I provide a screenshot of a variant. The first entry shows the DV, the second entry shows the support number and the third shows the names of the reads that support the variant. First, there is only one read ID that is listed but according to the SUPPORT number, there should be nine. Is there supposed to be a one to one relationship between these two? Second, what exactly is DV in relation to SUPPORT? They both have similar definitions where SUPPORT is defined as "Number of reads supporting the structural variation" and DV is defined as "Number of variant reads". As I understand it, the variants detected by Sniffles2 are structural variants, therefore these two definitions are the same. Capture

Thank you :)

hermannromanek commented 3 months ago

Hi @kamilafras

You're correct - SUPPORT and DV and length of RNAMES should be the same, and this should be fixed (save an edge case with long insertions) in the current version of sniffles. Can you maybe try running it again on the most recent version and tell us if there's still an issue there.

Thanks, Hermann

janeshen91 commented 2 months ago

@hermannromanek I'm wondering if you can talk a bit about the edge case with long insertions? I've been seeing long insertions that have a lot of support (say 1466), but have only 1 RNAME printed. It would also say the support from clipped reads is 1465.

Ad_ED151 1559 Sniffles2.INS.7S0 N 60 PASS PRECISE;SVTYPE=INS;SVLEN=31431;END=1559;SUPPORT=1466;RNAMES=ec4bb941-1665-4f95-82dc-9b8de6994e89;COVERAGE=166318,165450,165013,164775,162988;STRAND=+;NM=0.025;AF=0.103;STDEV_LEN=0;STDEV_POS=0;SUPPORT_LONG=1465 GT:GQ:DR:DV ./.:60:147987:17026

I found this in the sniffles2 paper: Long insertions (that is, multiple kbp) are often difficult to detect even in long-read data because reads often do not span the full insertion sequence. To improve detection of long insertions, Sniffles2 records these clipped read events as additional support for presence of a large insertion. This enables Sniffles2 to accurately detect large insertions even when the SV is fully covered by just a single read.

Does this mean that I can assume we only. had one read that spanned the entire insertion, and 1465 other reads that have clipped reads in that same region but whose sequence was not checked by sniffles2?

thanks Jane

fritzsedlazeck commented 2 months ago

Yeah so how it works is that reads are treated differently if they are split or not. For not right now its scanned for alignment events.

For the case of the status of "long insertions": Sniffles is looking at clipped reads (Tag S or H in cigar). if these are above the parameter they are considered as support for long-insertion and thus assigned to split read or alignemnt information.

We recently had a bug about reporting support vs. readnumbers/ names for this scenario since the clipped reads dont agree on the potential size of the INS but just agrees that there is a signal. So you can see it is a bit complicated.

Hope that helps a little Fritz