fritzsedlazeck / Sniffles

Structural variation caller using third generation sequencing
Other
542 stars 90 forks source link

insertion REF alleles and deletion ALT alleles are always set to "N" #422

Open eblerjana opened 1 year ago

eblerjana commented 1 year ago

Hi,

I'm working with SV callsets produced by sniffles2 (v2.0.7). For deletions, the ALT sequence in the VCFs are alwasy set to N, while the REF field contains the reference sequence. For insertions, it is the other way around. Here, the REF allele is always N (which does not match the actual reference sequence at these positions). This leads to several problems when applying tools like bcftools to post-process these VCFs (e. g. bcftools norm --check-ref reports mismatches with the reference genome when REF is set to N).

To me, this looks like the N is used in the REF/ALT field is representing an empty sequence? If this is the case, in order to fix my sniffles2 VCFs, can I simply modify my VCFs by adding the reference base before the variant to the left of REF + ALT alleles, following the VCF specifications (4.2):

For simple insertions and deletions in which either the REF or one of the ALT alleles would otherwise be null/empty, the REF and ALT Strings must include the base before the event (which must be reflected in the POS field)

Or is there a script already to fix the VCFs to contain the actual reference sequence instead of N?

Thanks, Jana

defendant602 commented 10 months ago

got the same problem, ref allele of insertion and alt allele of deletion are always N. (sniffles version 2.2)

yaningyang commented 9 months ago

sniffles contains an optional parameter --reference --reference reference.fasta (Optional) Reference sequence the reads were aligned against. To enable output of deletion SV sequences, this parameter must be set. (default: None)

nextgenusfs commented 8 months ago

--reference does not fix the issue in v2.2, the reference alleles are still "N". This is incorrect (it seems) based on the VCF 4.2 spec. To fix seems like the position should be moved "left" 1 bp and that should be used as the ref allele for INS and the alt allele for DEL. Here I'm trying to call from a de novo assembly against the reference.

$ sniffles --version
Sniffles2, Version 2.2

$ minimap2 -ax asm5 genome.fasta query-genome.fasta | samtools sort -o sniff.bam - 

$ sniffles -i sniff.bam -t 1 --no-qc --reference genome.fasta -v sniff.vcf

And then here is an example of an INS where ref allele is N. Note I've shortened the ALT allele sequence here for readability.

chr3    84234   Sniffles2.INS.1S1   N   ATAA...AATTC    60  PASS    PRECISE;SVTYPE=INS;SVLEN=4658;END=84234;SUPPORT=1;COVERAGE=1,1,1,1,1;STRAND=+;AF=1.000;STDEV_LEN=0;STDEV_POS=0;SUPPORT_LONG=0   GT:GQ:DR:DV 1/1:5:0:2

And then also an example of a DEL where ALT == 'N' (shortened the REF allele here for readability).

chr6    404818  Sniffles2.DEL.6S6       CCG...GCGA      N       60      SUPPORT_MIN     PRECISE;SVTYPE=DEL;SVLEN=-966;END=405784;SUPPORT=1;COVERAGE=1,1,1,1,1;STRAND=+;AF=1.000;STDEV_LEN=0;STDEV_POS=0      GT:GQ:DR:DV     1/1:2:0:1
ethering commented 7 months ago

--reference does not fix the issue in v2.2, the reference alleles are still "N". This is incorrect (it seems) based on the VCF 4.2 spec. To fix seems like the position should be moved "left" 1 bp and that should be used as the ref allele for INS and the alt allele for DEL. Here I'm trying to call from a de novo assembly against the reference.

$ sniffles --version
Sniffles2, Version 2.2

$ minimap2 -ax asm5 genome.fasta query-genome.fasta | samtools sort -o sniff.bam - 

$ sniffles -i sniff.bam -t 1 --no-qc --reference genome.fasta -v sniff.vcf

And then here is an example of an INS where ref allele is N. Note I've shortened the ALT allele sequence here for readability.

chr3  84234   Sniffles2.INS.1S1   N   ATAA...AATTC    60  PASS    PRECISE;SVTYPE=INS;SVLEN=4658;END=84234;SUPPORT=1;COVERAGE=1,1,1,1,1;STRAND=+;AF=1.000;STDEV_LEN=0;STDEV_POS=0;SUPPORT_LONG=0   GT:GQ:DR:DV 1/1:5:0:2

And then also an example of a DEL where ALT == 'N' (shortened the REF allele here for readability).

chr6    404818  Sniffles2.DEL.6S6       CCG...GCGA      N       60      SUPPORT_MIN     PRECISE;SVTYPE=DEL;SVLEN=-966;END=405784;SUPPORT=1;COVERAGE=1,1,1,1,1;STRAND=+;AF=1.000;STDEV_LEN=0;STDEV_POS=0      GT:GQ:DR:DV     1/1:2:0:1

I wonder if this is the same issue that I raised for Survivor, which perhaps would be better served here: https://github.com/fritzsedlazeck/SURVIVOR/issues/202