Bioconductor / VariantAnnotation

Annotation of Genetic Variants
https://bioconductor.org/packages/VariantAnnotation
23 stars 20 forks source link

`writeVcf` doesn't capture ALT Descriptions properly #52

Closed LTLA closed 2 years ago

LTLA commented 2 years ago
fl <- system.file("extdata", "structural.vcf", package="VariantAnnotation")
vcf <- readVcf(fl, genome="hg19")
tmp <- tempfile()
writeVcf(vcf, filename=tmp)
lines[grepl("ALT=", lines)]
##  [1] "##ALT=<ID=DEL,Description=Deletion>"                       
##  [2] "##ALT=<ID=DEL:ME:ALU,Description=Deletion of ALU element>" 
##  [3] "##ALT=<ID=DEL:ME:L1,Description=Deletion of L1 element>"   
##  [4] "##ALT=<ID=DUP,Description=Duplication>"                    
##  [5] "##ALT=<ID=DUP:TANDEM,Description=Tandem Duplication>"      
##  [6] "##ALT=<ID=INS,Description=Insertion of novel sequence>"    
##  [7] "##ALT=<ID=INS:ME:ALU,Description=Insertion of ALU element>"
##  [8] "##ALT=<ID=INS:ME:L1,Description=Insertion of L1 element>"  
##  [9] "##ALT=<ID=INV,Description=Inversion>"                      
## [10] "##ALT=<ID=CNV,Description=Copy number variable region>"    

Note the lack of quotes in the Description value. This means that a subsequent readVcf does not recognize this field as a string and discards the ALT information altogether. The cause is here:

https://github.com/Bioconductor/VariantAnnotation/blob/54d2de6c7a1aa76f404076db79077cd05114edf2/R/methods-writeVcf.R#L181-L184

where there is a lack of special-casing for the Description column, which is present in the general case.

vjcitn commented 2 years ago

Restating the code chunk as there is a line missing above:

fl <- system.file("extdata", "structural.vcf", package="VariantAnnotation")
vcf <- readVcf(fl, genome="hg19")
tmp <- tempfile()
writeVcf(vcf, filename=tmp)
lines = readLines(tmp) # missing from chunk above
lines[grepl("ALT=", lines)]

Output for ALT is clearly problematic:

##  [1] "##ALT=<ID=DEL,Description=Deletion>"                       
##  [2] "##ALT=<ID=DEL:ME:ALU,Description=Deletion of ALU element>" 
...

I see how to fix the code in writeVcf to quote the description text. But I am not sure doing this affects the output in the desired way. Can you say a little more about the defect of the readVcf result? I am seeing ALT information in the readVcf even in the bugged version. What is meant by "discards the ALT information altogether"?

LTLA commented 2 years ago

When you try to call readVcf on the incorrect headers, it will not be able to parse the string properly.

# Assuming tmp was generated as above.
fixed(header(readVcf(tmp)))$ALT
## DataFrame with 10 rows and 0 columns

In contrast:

fixed(header(vcf))$ALT
## DataFrame with 10 rows and 1 column
##                       Description
##                       <character>
## DEL                      Deletion
## DEL:ME:ALU Deletion of ALU elem..
## DEL:ME:L1  Deletion of L1 element
## DUP                   Duplication
## DUP:TANDEM     Tandem Duplication
## INS        Insertion of novel s..
## INS:ME:ALU Insertion of ALU ele..
## INS:ME:L1  Insertion of L1 elem..
## INV                     Inversion
## CNV        Copy number variable..
vjcitn commented 2 years ago

Should be fixed in devel branch on git, which has version 1.41.3

hpages commented 2 years ago

Fixed confirmed