ccagc / QDNAseq

QDNAseq package for Bioconductor
48 stars 27 forks source link

Output VCF is not valid due to symbolic allele used as reference allele #121

Open vlshesketh opened 10 months ago

vlshesketh commented 10 months ago

The VCFs generated by QDNAseq do not conform to the specification for the reference allele field, as they contain symbolic alleles e.g. <DIP>. This results in errors when trying to either validate the VCF using VCF validator:

Error: Reference is not a string of bases. This occurs 3 time(s), first time in line 14.

Or when loading the VCF into IGV:

The provided VCF file is malformed at approximately line number 14: Symbolic alleles not allowed as reference allele

The specification requires that the reference should a string of one or more bases (A,C,G,T or N, from https://samtools.github.io/hts-specs/VCFv4.4.pdf). Page 31 of the guidelines gives some examples of how to represent copy number events, or alternatively other structural variant/CNV callers use N as the reference base.

HenrikBengtsson commented 10 months ago

Thanks for this, and thanks for digging out the relevant references.

Can please attach an example of a VCF file so we, or someone else, can reproduce it with vcf_validator < /path/to/file.vcf (assuming that's the call). Also, what vcf_validator version do you use?

I'm not questioning your conclusions; I'm asking so we can create a minimal reproducible example, that we then can add as a unit test, which makes it easier to fix the bug.

vlshesketh commented 10 months ago

Thanks for the speedy reply! I've attached a VCF which fails validation (renamed extension to allow upload), as well as the output from vcf-validator. I'm using the latest version of vcf-validator which is 0.9.4.

SAMPLE.wf_cnv.vcf.txt SAMPLE.wf_cnv.vcf.errors_summary.1705397450367.txt

vlshesketh commented 8 months ago

Hello, is it possible to get an update on when the VCF will be modified so that it validates successfully?

sachingadakh commented 4 months ago

I run into same issue while loading VCF on IGV, "The provided VCF file is malformed at approximately line number 14: Symbolic alleles not allowed as reference allele" Is there any update on resolving this issue? However I replaced DIP with N resolved the issue

vmkalbskopf commented 1 month ago

I'm running into this error on IGV too.

HenrikBengtsson commented 3 weeks ago

Recap summary of this problem

The example VCF file that @vlshesketh provided in https://github.com/ccagc/QDNAseq/issues/121#issuecomment-1893375651 is:

##fileformat=VCFv4.2
##source=QDNAseq-1.34.0
##REF=<ID=DIP,Description="CNV call">
##ALT=<ID=DEL,Description="Deletion">
##ALT=<ID=DUP,Description="Duplication">
##FILTER=<ID=LOWQ,Description="Filtered due to call in low quality region">
##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of variant: DEL,DUP,INS">
##INFO=<ID=SVLEN,Number=1,Type=Integer,Description="Length of variant">
##INFO=<ID=BINS,Number=1,Type=Integer,Description="Number of bins in call">
##INFO=<ID=SCORE,Number=1,Type=Integer,Description="Score of calling algorithm">
##INFO=<ID=LOG2CNT,Number=1,Type=Float,Description="Log 2 count">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  test
21  14000001    .   <DIP>   <DUP>   1000    PASS    SVTYPE=DUP;END=46709983;SVLEN=32709983;BINS=60;SCORE=1;LOG2CNT=0.57 GT  0/1
X   1   .   <DIP>   <DEL>   1000    PASS    SVTYPE=DEL;END=156000000;SVLEN=156000000;BINS=274;SCORE=-1;LOG2CNT=-0.9 GT  0/1
Y   2500001 .   <DIP>   <DEL>   1000    PASS    SVTYPE=DEL;END=27000000;SVLEN=24500000;BINS=26;SCORE=-1;LOG2CNT=-1.04   GT  0/1

Using the lastest vcf-validator 0.10.0;

$ vcf_validator < SAMPLE.wf_cnv.vcf
[info] Reading from standard input...
[info] Summary report written to : stdin.errors_summary.1729547373906.txt
[info] According to the VCF specification, the input file is not valid

gives

$ cat stdin.errors_summary.1729547373906.txt
According to the VCF specification, the input file is not valid
Error: Reference is not a string of bases. This occurs 3 time(s), first time in line 14.
Warning: A valid 'reference' entry is not listed in the meta section. This occurs 1 time(s), first time in line 14.

The problematic line is:

$ sed -n '14p' SAMPLE.wf_cnv.vcf
21  14000001    .   <DIP>   <DUP>   1000    PASS    SVTYPE=DUP;END=46709983;SVLEN=32709983;BINS=60;SCORE=1;LOG2CNT=0.57 GT  0/1

Section 5.6 of the VCF v4.4 specification says:


5.6 Representing copy number variation

To encode copy number variation, VCF uses \<CNV>, \<DEL> and \<DUP> symbolic structural variant alleles, CN INFO and FORMAT fields.

Allele specific copy number is specified through a \<CNV> ALT allele for each distinct allelic copy number. INFO CN defines the allele specific copy number with FORMAT CN defining the overall copy number for that sample. POS and INFO SVLEN specify the genomic interval over which the copy number is defined. \<DEL> and \<DUP> copy number (SVCLAIM=D) alleles should be treated as \<CNV> alleles that implicitly define INFO CN=0 and CN=2, respectively. As with all symbolic structural variants, the starting position of the interval is the base immediately after POS. For example, a region on chr1 from position 101 to 130 (both inclusive) with allele-specific copy numbers of 1 and 2 can be represented as follows:

chr1 100 . T <CNV>,<CNV> . . END=130;SVLEN=30,30;CN=1,2 GT:CN 1/2:3

All \<CNV> alleles in the same VCF record should have the same SVLEN. To eliminate genotype ambiguity, copy number ALT alleles should not be mixed with other ALT alleles. When only copy number ALT alleles are present in a VCF record, GT=0 is equivalent to a \<CNV> ALT allele with INFO CN of 1 and should be treated identically.

If only total copy number is known, the copy number of the segment should be defined with a single \<CNV> ALT allele with a missing INFO CN field. In the above example this corresponds to the following:

chr1 100 . T <CNV> . . END=130;SVLEN=30 GT:CN .:3

The granularity of copy number representation is explicitly not defined in these specifications. Copy number segmentation can be base-pair accurate with even 1bp changes deletions resulting in new copy number segments, be at a highly granular megabase level of resolution, or anywhere in between. When the bounds of a copy number segment is not known precisely, this should be encoded in the CIPOS and CILEN INFO fields.


Help needed

I have very little time to work on this; can some please show and explain what output QDNAseq should give instead of the current:

chr1 100 . T <CNV>,<CNV> . . END=130;SVLEN=30,30;CN=1,2 GT:CN 1/2:3

to meet the VCF specifications?