Variant type definitions

thkitapci commented 5 years ago

Hi, In the VCF file resulting from my paired sample analysis I see these types of variants

TYPE=Complex TYPE=DEL TYPE=Deletion TYPE=DUP TYPE=Insertion TYPE=INV TYPE=SNV

I don't understand what is "DEL" ? First I thought this is a deletion but then I saw the other field called "Deletion", can you please explain the meaning of DEL ?

Thanks Best Regards Hamdi

PolinaBevad commented 5 years ago

Hamdi, hello, This "DEL" type determines structural deletions. We have the separate type for them to make it easier to distinguish them from ordinary deletions (they have "Deletion" type).

thkitapci commented 5 years ago

Hi @PolinaBevad , Thanks for the quick reply. I am still confused on the difference between "structural deletion" and "ordinary deletion" can you please provide a simple example to explain these two events ?

Thanks! Hamdi

PolinaBevad commented 5 years ago

Hamdi, we determine deletions as structural if their length is more then 1000 bp and this length can be changed by option -L (because there no rigid limit for SVs). We use additional algorithms to determine structural variants instead of SNP and MNP, because they are large-scale differences in genome and often can be associated with diseases (and they can be found in databases like dbVar or DGV). So simply it is a very long MNP deletion that can have a great influence on the organism.

PolinaBevad commented 5 years ago

Hamdi, hello! I will close this issue to keep the repository up-to-date, but please reopen it if you will have any other questions on this topic. Thank you for reporting this issue and for helping us improve VarDict!

thkitapci commented 5 years ago

Thanks for the clarification!

thkitapci commented 5 years ago

Hi @PolinaBevad , I have found a "DEL" and a "DUP" in my analysis. I want to figure out the exact size of this deletion and duplication. Can you please help me on how to extract this information from the resulting VCF file ? (These are structural deletion and duplications so they are at least 1000bp with default parameters is that correct ?)

In addition to getting the exact size of these events I want to visualize this using something like IGV do you have any suggestion on the best method to visualize DEL and DUP events ?

Thanks! Best Regards Hamdi

PolinaBevad commented 5 years ago

Hamdi, hello!

What type of analysis do you use? We have all information about SVs in the VCF result for single analysis (the length of SV can be found in SVLEN field in INFO column), but for paired analysis, it has limited functionality and information, without length. If you run paired analysis and you have the intermediate result file from it (i.e. before R script), the length of SV will be in GENOTYPE columns in this file. So if you open VCF file from the single analysis in IGV, then SV will be shown as a long variant and all parameters will be shown. For paired analysis, it will only show the start position of the variant, because VCF file doesn't have this information. I'm not sure if we will extend SV functionality in a paired analysis in the near future but I will try to discuss it.

The length of SV can be less than 1000 bp for DUP and INV variants because they are complex changes, but for DEL it will be always more than 1000 bp.

thkitapci commented 5 years ago

Hi @PolinaBevad, I have 10 samples after running vardict on all the samples I merged the resulting VCF files using bcftools

bcftools merge --merge all sample1.vcf.gz sample2.vcf.gz sample3.vcf.gz >merged_with_bcftools.vcf

I realized that a lot of information regarding individual samples such as the length of SVs is lost when I do this merge. (I can share my individual and merged VCF files if this will help clarify the issue) When I went back and looked at the individual VCF files for each sample I found the following field

SVTYPE=DEL;SVLEN=3640

which tells that I have a deletion of size 3640.

I have two questions

1) How can I extract start and end coordinates for this DEL in reference to the reference fasta file ?

2) Is there a way to merge individual VCF files without loosing information regarding individual samples?

Thanks!

PolinaBevad commented 5 years ago

Hamdi, hello,

I do not fully understand how coordinates can be extracted to the reference fasta file, but if you want to get start and end coordinates of DEL, then VCF contains the start of the DELs and other SVs in POS column and the end in END field in INFO section.

Utils that I used (vcftools, bcftools) will combine INFO fields in some ways, so I think there will be a lack of information anyway. I'm not fully sure if it will be the same with GATK MergeVCF, maybe you can try it?

thkitapci commented 5 years ago

Hi @PolinaBevad, The END fields is missing for the DEL I am looking

Here is the header of the VCF file for one of my samples

##fileformat=VCFv4.1
##INFO=<ID=SAMPLE,Number=1,Type=String,Description="Sample name (with whitespace translated to underscores)">
##INFO=<ID=TYPE,Number=1,Type=String,Description="Variant Type: SNV Insertion Deletion Complex">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=END,Number=1,Type=Integer,Description="Chr End Position">
##INFO=<ID=VD,Number=1,Type=Integer,Description="Variant Depth">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##INFO=<ID=BIAS,Number=1,Type=String,Description="Strand Bias Info">
##INFO=<ID=REFBIAS,Number=1,Type=String,Description="Reference depth by strand">
##INFO=<ID=VARBIAS,Number=1,Type=String,Description="Variant depth by strand">
##INFO=<ID=PMEAN,Number=1,Type=Float,Description="Mean position in reads">
##INFO=<ID=PSTD,Number=1,Type=Float,Description="Position STD in reads">
##INFO=<ID=QUAL,Number=1,Type=Float,Description="Mean quality score in reads">
##INFO=<ID=QSTD,Number=1,Type=Float,Description="Quality score STD in reads">
##INFO=<ID=SBF,Number=1,Type=Float,Description="Strand Bias Fisher p-value">
##INFO=<ID=ODDRATIO,Number=1,Type=Float,Description="Strand Bias Odds ratio">
##INFO=<ID=MQ,Number=1,Type=Float,Description="Mean Mapping Quality">
##INFO=<ID=SN,Number=1,Type=Float,Description="Signal to noise">
##INFO=<ID=HIAF,Number=1,Type=Float,Description="Allele frequency using only high quality bases">
##INFO=<ID=ADJAF,Number=1,Type=Float,Description="Adjusted AF for indels due to local realignment">
##INFO=<ID=SHIFT3,Number=1,Type=Integer,Description="No. of bases to be shifted to 3 prime for deletions due to alternative alignment">
##INFO=<ID=MSI,Number=1,Type=Float,Description="MicroSatellite. > 1 indicates MSI">
##INFO=<ID=MSILEN,Number=1,Type=Float,Description="MicroSatellite unit length in bp">
##INFO=<ID=NM,Number=1,Type=Float,Description="Mean mismatches in reads">
##INFO=<ID=LSEQ,Number=1,Type=String,Description="5' flanking seq">
##INFO=<ID=RSEQ,Number=1,Type=String,Description="3' flanking seq">
##INFO=<ID=GDAMP,Number=1,Type=Integer,Description="No. of amplicons supporting variant">
##INFO=<ID=TLAMP,Number=1,Type=Integer,Description="Total of amplicons covering variant">
##INFO=<ID=NCAMP,Number=1,Type=Integer,Description="No. of amplicons don't work">
##INFO=<ID=AMPFLAG,Number=1,Type=Integer,Description="Top variant in amplicons don't match">
##INFO=<ID=HICNT,Number=1,Type=Integer,Description="High quality variant reads">
##INFO=<ID=HICOV,Number=1,Type=Integer,Description="High quality total reads">
##INFO=<ID=SPLITREAD,Number=1,Type=Integer,Description="No. of split reads supporting SV">
##INFO=<ID=SPANPAIR,Number=1,Type=Integer,Description="No. of pairs supporting SV">
##INFO=<ID=SVTYPE,Number=1,Type=String,Description="SV type: INV DUP DEL INS FUS">
##INFO=<ID=SVLEN,Number=1,Type=Integer,Description="The length of SV in bp">
##INFO=<ID=DUPRATE,Number=1,Type=Float,Description="Duplication rate in fraction">
##FILTER=<ID=q22.5,Description="Mean Base Quality Below 22.5">
##FILTER=<ID=Q10,Description="Mean Mapping Quality Below 10">
##FILTER=<ID=p8,Description="Mean Position in Reads Less than 8">
##FILTER=<ID=SN1.5,Description="Signal to Noise Less than 1.5">
##FILTER=<ID=Bias,Description="Strand Bias">
##FILTER=<ID=pSTD,Description="Position in Reads has STD of 0">
##FILTER=<ID=d3,Description="Total Depth < 3">
##FILTER=<ID=v2,Description="Var Depth < 2">
##FILTER=<ID=f0.01,Description="Allele frequency < 0.01">
##FILTER=<ID=MSI12,Description="Variant in MSI region with 12 non-monomer MSI or 13 monomer MSI">
##FILTER=<ID=NM5.25,Description="Mean mismatches in reads >= 5.25, thus likely false positive">
##FILTER=<ID=InGap,Description="The variant is in the deletion gap, thus likely false positive">
##FILTER=<ID=InIns,Description="The variant is adjacent to an insertion variant">
##FILTER=<ID=Cluster0bp,Description="Two variants are within 0 bp">
##FILTER=<ID=LongMSI,Description="The somatic variant is flanked by long A/T (>=14)">
##FILTER=<ID=AMPBIAS,Description="Indicate the variant has amplicon bias.">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##FORMAT=<ID=VD,Number=1,Type=Integer,Description="Variant Depth">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##FORMAT=<ID=RD,Number=2,Type=Integer,Description="Reference forward, reverse reads">
##FORMAT=<ID=ALD,Number=2,Type=Integer,Description="Variant forward, reverse reads">

Here is the row containing the DEL

10 11291573 . G <DEL> 235 PASS SAMPLE=T7-69_chr10;TYPE=DEL;DP=1551;VD=83;AF=1.8864;BIAS=2:2;REFBIAS=26:15;VARBIAS=15:67;PMEAN=43.4;PSTD=1;QUAL=36.9;QSTD=1;SBF=0;ODDRATIO=7.58259;MQ=255;SN=166;HIAF=0.6640;ADJAF=0.25;SHIFT3=0;MSI=0;MSILEN=0;NM=0.2;HICNT=83;HICOV=125;LSEQ=CATGGCAGATAAAAGAGAAG;RSEQ=TAGGTATGGCGATTCAGTAC;DUPRATE=0;SVTYPE=DEL;SVLEN=3640;SPLITREAD=2;SPANPAIR=825 GT:DP:VD:AD:AF:RD:ALD 1/1:1551:83:41,83:1.8864:26,15:15,67

In the INFO field there is no "END" field.

Am I looking to the wrong location ?

I have the POS which is the "START" and the SVLEN can I do POS+SVLEN to get the "END" position on the chromosome ?

Here is a link to the VCF file

https://drive.google.com/file/d/16uF5rfEy4vlI2CkVBjWJjYOQqsP_w-p1/view?usp=sharing

Thanks Best Regards Hamdi

PolinaBevad commented 5 years ago

Hamdi, hello!

I see that the tag is missing in the file. Tag "END" must be printed for a single analysis by default after DP value, and it will not be printed if you use -E option with var2vcf_valid.pl script. Is this the case? Anyway, yes, you can calculate END as POS+SVLEN, it is correct.

AstraZeneca-NGS / VarDictJava

Variant type definitions #224