fritzsedlazeck / Sniffles

Structural variation caller using third generation sequencing
Other
545 stars 91 forks source link

<INS> sequence missing in ALT field and represented as C<INS>, T<INS>, A<INS>, G<INS> #501

Open leone93 opened 1 month ago

leone93 commented 1 month ago

As the title says. Running sniffles 2.4 with this setting on my sample:

sniffles -t "$thr" --minsupport 3 --max-del-seq-len 1000000 --reference "$REF.fasta" --input "$ALIGN_DIR/$i/$i.sorted.bam" \
--output-rnames --sample-id "$i" --vcf "$SV_DIR/samples/$i/$i.$REF_NAME.sniffles.vcf.gz" --snf "$SV_DIR/samples/$i/$i.$REF_NAME.sniffles.snf"
echo -e "$SV_DIR/samples/$i/$i.$REF_NAME.sniffles.snf\t$i" >> "$SV_DIR/samples_list_4_multisample.tsv"

produces files with this value in the ALT field of many INS. Moreover, this error is propagated to to the multisample files created with:

sniffles --threads "$thr" \
--reference "$REF.fasta" \
--input "$SV_DIR/samples_list_4_multisample.tsv" \
--vcf multisample.$REF_NAME.sniffles.vcf.gz

More details. It does not depend on SVLEN (it is random from that point of view). This error is not present on sniffles 2.2 (same settings), although I have some INS also there. Also, the quantity of INS in the multisample files is way lower in sniffles 2.2 compare to 2.4 on the same dataset (20 entries vs. 420). Is there a way to have the proper sequence on the alt field for INS?

hermannromanek commented 1 month ago

Hi @leone93

I'm looking into this issue - is it possible for you to share a couple of occurences from the generated VCF? This would help a lot in identifying whats going on.

As for the quantity of inserts, this is most likely a consequence of the newer version of sniffles also doing alignments, and possible separating calls that are sufficiently different. This is controlled by the merge param --combine-pctseq which defaults to 0.7, i.e. only calls whose sequence are above 70% similarity are reported as a single variant.

This is a new feature - did you check the variants to see if those get reported wrongly? Setting the mentioned parameter to 0 should emulate the 2.2 behaviour (i.e., no alignment checks being done on variants).

Thanks, Hermann

leone93 commented 3 weeks ago

Hi @hermannromanek and @fritzsedlazeck ! Sorry for the late response! here you are a couple of these INS lines from a single sample vcf: Chr1 41126819 Sniffles2.INS.4E8S0 T T<INS> 60 PASS IMPRECISE;SVTYPE=INS;SVLEN=6432;END=41126819;SUPPORT=34;RNAMES=m64313e_220704_023609/102760552/ccs,m64313e_220704_023609/96338895/ccs,m64313e_220704_023609/30867756/ccs,m64313e_220704_023609/151258310/ccs,m64313e_220704_023609/113640328/ccs,m64313e_220704_023609/179110627/ccs,m64313e_220704_023609/165546595/ccs,m64313e_220704_023609/27789126/ccs,m64313e_220704_023609/119799858/ccs,m64313e_220704_023609/121766126/ccs,m64313e_220704_023609/24578579/ccs,m64313e_220704_023609/61540535/ccs,m64313e_220704_023609/154601259/ccs,m64313e_220704_023609/67109779/ccs,m64313e_220704_023609/78250968/ccs,m64313e_220704_023609/23791966/ccs,m64313e_220704_023609/92146483/ccs,m64313e_220704_023609/25364332/ccs,m64313e_220704_023609/170985260/ccs,m64313e_220704_023609/160367877/ccs,m64313e_220704_023609/3541029/ccs,m64313e_220704_023609/6817328/ccs,m64313e_220704_023609/149949202/ccs,m64313e_220704_023609/42730503/ccs,m64313e_220704_023609/111085705/ccs,m64313e_220704_023609/56427088/ccs,m64313e_220704_023609/136971812/ccs,m64313e_220704_023609/104465954/ccs,m64313e_220704_023609/27134070/ccs,m64313e_220704_023609/68616675/ccs,m64313e_220704_023609/67240813/ccs,m64313e_220704_023609/27394082/ccs,m64313e_220704_023609/30213806/ccs,m64313e_220704_023609/3148415/ccs;COVERAGE=18,1,1,1,24;STRAND=+;AF=1.000;STDEV_LEN=38.532;STDEV_POS=0.000;SUPPORT_LONG=27 GT:GQ:DR:DV 1/1:60:0:85 Chr4 8607072 Sniffles2.INS.184S3 A A<INS> 45 PASS PRECISE;SVTYPE=INS;SVLEN=63;END=8607072;SUPPORT=5;RNAMES=m64313e_220704_023609/173344440/ccs,m64313e_220704_023609/99287661/ccs,m64313e_220704_023609/170591313/ccs,m64313e_220704_023609/14877803/ccs,m64313e_220704_023609/117442152/ccs;COVERAGE=22,21,21,21,16;STRAND=+-;AF=0.238;STDEV_LEN=0.000;STDEV_POS=0.000;SUPPORT_LONG=0 GT:GQ:DR:DV 0/1:5:16:5 Chr4 31659553 Sniffles2.INS.3E7S3 T T<INS> 55 GT PRECISE;SVTYPE=INS;SVLEN=163;END=31659553;SUPPORT=3;RNAMES=m64313e_220704_023609/132055311/ccs,m64313e_220704_023609/147130830/ccs,m64313e_220704_023609/45484245/ccs;COVERAGE=14,16,16,16,16;STRAND=+-;AF=0.188;STDEV_LEN=0.577;STDEV_POS=0.000;SUPPORT_LONG=0 GT:GQ:DR:DV 0/0:6:13:3 Chr5 16387275 Sniffles2.INS.20CS4 T T<INS> 60 PASS PRECISE;SVTYPE=INS;SVLEN=1276;END=16387275;SUPPORT=5;RNAMES=m64313e_220704_023609/114034422/ccs,m64313e_220704_023609/74908093/ccs,m64313e_220704_023609/134154239/ccs,m64313e_220704_023609/19267721/ccs,m64313e_220704_023609/174262772/ccs;COVERAGE=11,12,12,13,29;STRAND=+-;AF=0.417;STDEV_LEN=3.000;STDEV_POS=0.000;SUPPORT_LONG=0 GT:GQ:DR:DV 0/1:30:7:5 Chr6 3652857 Sniffles2.INS.5BS5 A A<INS> 60 PASS PRECISE;SVTYPE=INS;SVLEN=5839;END=3652857;SUPPORT=4;RNAMES=m64313e_220704_023609/112591763/ccs,m64313e_220704_023609/10094382/ccs,m64313e_220704_023609/5768041/ccs,m64313e_220704_023609/39780805/ccs;COVERAGE=20,1,1,1,11;STRAND=+;AF=1.000;STDEV_LEN=0;STDEV_POS=0;SUPPORT_LONG=3 GT:GQ:DR:DV 1/1:27:0:10 Chr6 11141004 Sniffles2.INS.180S5 T T<INS> 60 PASS PRECISE;SVTYPE=INS;SVLEN=7472;END=11141004;SUPPORT=22;RNAMES=m64313e_220704_023609/17826740/ccs,m64313e_220704_023609/76547106/ccs,m64313e_220704_023609/85984472/ccs,m64313e_220704_023609/11667759/ccs,m64313e_220704_023609/134283470/ccs,m64313e_220704_023609/170591116/ccs,m64313e_220704_023609/23922660/ccs,m64313e_220704_023609/148966122/ccs,m64313e_220704_023609/170001124/ccs,m64313e_220704_023609/75695761/ccs,m64313e_220704_023609/84740071/ccs,m64313e_220704_023609/66913999/ccs,m64313e_220704_023609/8717987/ccs,m64313e_220704_023609/46268658/ccs,m64313e_220704_023609/72417931/ccs,m64313e_220704_023609/55249368/ccs,m64313e_220704_023609/60360010/ccs,m64313e_220704_023609/115018033/ccs,m64313e_220704_023609/22151892/ccs,m64313e_220704_023609/136382408/ccs,m64313e_220704_023609/16843737/ccs,m64313e_220704_023609/123994259/ccs;COVERAGE=19,2,2,2,8;STRAND=+-;AF=1.000;STDEV_LEN=2.828;STDEV_POS=0.000;SUPPORT_LONG=18 GT:GQ:DR:DV 1/1:60:0:58 Chr7 22465817 Sniffles2.INS.26CS6 A A<INS> 49 GT PRECISE;SVTYPE=INS;SVLEN=266;END=22465817;SUPPORT=4;RNAMES=m64313e_220704_023609/13895232/ccs,m64313e_220704_023609/11012018/ccs,m64313e_220704_023609/88342596/ccs,m64313e_220704_023609/44894366/ccs;COVERAGE=51,50,50,50,45;STRAND=+-;AF=0.080;STDEV_LEN=0.000;STDEV_POS=0.000;SUPPORT_LONG=0 GT:GQ:DR:DV 0/0:60:46:4 Chr8 6662224 Sniffles2.INS.A9S7 A A<INS> 28 GT IMPRECISE;SVTYPE=INS;SVLEN=445;END=6662224;SUPPORT=7;RNAMES=m64313e_220704_023609/31916765/ccs,m64313e_220704_023609/179505146/ccs,m64313e_220704_023609/137628233/ccs,m64313e_220704_023609/24313897/ccs,m64313e_220704_023609/37685523/ccs,m64313e_220704_023609/55378596/ccs,m64313e_220704_023609/37880543/ccs;COVERAGE=50,50,50,57,58;STRAND=+-;AF=0.140;STDEV_LEN=40.696;STDEV_POS=49.641;SUPPORT_LONG=0 GT:GQ:DR:DV 0/0:49:43:7 Chr11 12181444 Sniffles2.INS.17FSA A A<INS> 60 PASS IMPRECISE;SVTYPE=INS;SVLEN=3640;END=12181444;SUPPORT=5;RNAMES=m64313e_220704_023609/79301286/ccs,m64313e_220704_023609/53478128/ccs,m64313e_220704_023609/79955941/ccs,m64313e_220704_023609/88016459/ccs,m64313e_220704_023609/139329790/ccs;COVERAGE=67,76,76,77,91;STRAND=+-;AF=0.145;STDEV_LEN=27.465;STDEV_POS=0.000;SUPPORT_LONG=2 GT:GQ:DR:DV 0/0:60:65:11 Chr11 27841081 Sniffles2.INS.480SA T T<INS> 60 PASS PRECISE;SVTYPE=INS;SVLEN=394;END=27841081;SUPPORT=13;RNAMES=m64313e_220704_023609/93586025/ccs,m64313e_220704_023609/4326981/ccs,m64313e_220704_023609/123404550/ccs,m64313e_220704_023609/131007122/ccs,m64313e_220704_023609/69796052/ccs,m64313e_220704_023609/53674667/ccs,m64313e_220704_023609/19597916/ccs,m64313e_220704_023609/91620639/ccs,m64313e_220704_023609/91031592/ccs,m64313e_220704_023609/111217044/ccs,m64313e_220704_023609/107806724/ccs,m64313e_220704_023609/71764623/ccs,m64313e_220704_023609/40501307/ccs;COVERAGE=30,30,30,30,34;STRAND=+-;AF=0.433;STDEV_LEN=0.000;STDEV_POS=0.000;SUPPORT_LONG=0 GT:GQ:DR:DV 0/1:60:17:13 Chr12 8902607 Sniffles2.INS.10ESB T T<INS> 60 PASS PRECISE;SVTYPE=INS;SVLEN=9019;END=8902607;SUPPORT=17;RNAMES=m64313e_220704_023609/90637006/ccs,m64313e_220704_023609/112984302/ccs,m64313e_220704_023609/160236139/ccs,m64313e_220704_023609/13239138/ccs,m64313e_220704_023609/105382991/ccs,m64313e_220704_023609/114426616/ccs,m64313e_220704_023609/90703249/ccs,m64313e_220704_023609/137823960/ccs,m64313e_220704_023609/13960686/ccs,m64313e_220704_023609/129827041/ccs,m64313e_220704_023609/78577958/ccs,m64313e_220704_023609/163513184/ccs,m64313e_220704_023609/43649010/ccs,m64313e_220704_023609/104203260/ccs,m64313e_220704_023609/29821465/ccs,m64313e_220704_023609/133564333/ccs,m64313e_220704_023609/32047747/ccs;COVERAGE=22,1,1,1,23;STRAND=+-;AF=1.000;STDEV_LEN=2.828;STDEV_POS=0.000;SUPPORT_LONG=15 GT:GQ:DR:DV 1/1:60:0:48 Chr12 9270541 Sniffles2.INS.113SB A A<INS> 60 PASS PRECISE;SVTYPE=INS;SVLEN=149;END=9270541;SUPPORT=6;RNAMES=m64313e_220704_023609/35523086/ccs,m64313e_220704_023609/37683345/ccs,m64313e_220704_023609/27918818/ccs,m64313e_220704_023609/56821920/ccs,m64313e_220704_023609/15206024/ccs,m64313e_220704_023609/127600358/ccs;COVERAGE=20,15,15,15,23;STRAND=+-;AF=0.400;STDEV_LEN=0.500;STDEV_POS=22.500;SUPPORT_LONG=0 GT:GQ:DR:DV 0/1:34:9:6 Chr12 11791587 Sniffles2.INS.18FSB A A<INS> 60 GT PRECISE;SVTYPE=INS;SVLEN=142;END=11791587;SUPPORT=3;RNAMES=m64313e_220704_023609/11929890/ccs,m64313e_220704_023609/3345120/ccs,m64313e_220704_023609/655939/ccs;COVERAGE=17,14,14,14,16;STRAND=+-;AF=0.214;STDEV_LEN=0.000;STDEV_POS=0.000;SUPPORT_LONG=0 GT:GQ:DR:DV 0/0:0:11:3 Chr12 14771624 Sniffles2.INS.274SB T T<INS> 60 PASS PRECISE;SVTYPE=INS;SVLEN=4084;END=14771624;SUPPORT=4;RNAMES=m64313e_220704_023609/27592328/ccs,m64313e_220704_023609/153356836/ccs,m64313e_220704_023609/173081666/ccs,m64313e_220704_023609/84084865/ccs;COVERAGE=10,10,10,11,20;STRAND=+-;AF=0.900;STDEV_LEN=0.000;STDEV_POS=0.000;SUPPORT_LONG=2 GT:GQ:DR:DV 1/1:15:1:9

For the multisample, I understand the --combine-pctseq paramter (it's the same of truvari), and maybe I will down it to 0.60 for my dataset. Another question about that: did you also compare the length for merging? I notice that even with a low limit of 50 bp for calling events, in the multisample files, some samples present a genotype, maybe in bigger variants, even though that sample genotype is below the minimum size. To make it more clear. In the multisample vcf, I have an INS of 60 bp that comes from a population of samples. Checking every sample in the aligned bam, I noticed 2 or 3 samples (out of 184) show that is true that there is an INS there, but that is below 50 bp (around 45-47) for these samples (and confirming that this event is not present in the single samples vcf of these samples). I think it is correct to say that there is an event there, but I'm not sure if it is grouped with the 60 bp event. Looking at in length, it is 30-35% more. How does sniffles merging logic work? Thanks for the clarification Leo