Bioconductor / VariantAnnotation

Annotation of Genetic Variants
https://bioconductor.org/packages/VariantAnnotation
26 stars 20 forks source link

Problems in parsing VCF fields containing multiple entries separated by commas (example with Mutect and Freebayes) #27

Closed annaquaglieri16 closed 5 years ago

annaquaglieri16 commented 5 years ago

Hi there,

I really love this package and I wrote one function to parse several type of VCF output (from several caller) into a standardised format with standardised columns (https://annaquaglieri16.github.io/varikondo/articles/vignette.html)

However, there is a problem when parsing VCF fields that contain more than one entry separated by a comma. For example, the QSS field in MuTect2 contains base_quality_ref,base_quality_alt'.VariantAnnotation` only reads in the first column. Below is a reproducible example.

library(varikondo)

mutect_vcf <- system.file("extdata", "germline_mutect.vcf", package = "varikondo")
vcf <- VariantAnnotation::readVcf(mutect_vcf)

head(VariantAnnotation::geno(vcf)$QSS)
                   SRX381851
chr1:36933772_C/T  15571    
chr1:36935370_T/C  13932    
chr1:36937059_A/G  7351     
chr1:36939108_T/C  10997    
chr1:36941395_G/C  6697     
chr1:36941539_G/GT 5449     
vcf_read_table <- read.table(mutect_vcf)
head(vcf_read_table[,c("V9","V10")])
                                                      V9                                               V10
1 GT:AD:AF:ALT_F1R2:ALT_F2R1:FOXOG:QSS:REF_F1R2:REF_F2R1       0/1:523,9:0.017:7:2:0.222:15571,247:261:262
2 GT:AD:AF:ALT_F1R2:ALT_F2R1:FOXOG:QSS:REF_F1R2:REF_F2R1       0/1:482,7:0.014:5:2:0.714:13932,203:257:225
3 GT:AD:AF:ALT_F1R2:ALT_F2R1:FOXOG:QSS:REF_F1R2:REF_F2R1 0/1:249,239:0.504:109:130:0.544:7351,6930:123:126
4 GT:AD:AF:ALT_F1R2:ALT_F2R1:FOXOG:QSS:REF_F1R2:REF_F2R1       0/1:377,6:0.020:5:1:0.833:10997,186:176:201
5 GT:AD:AF:ALT_F1R2:ALT_F2R1:FOXOG:QSS:REF_F1R2:REF_F2R1         0/1:223,3:0.016:1:2:0.333:6697,93:122:101
6 GT:AD:AF:ALT_F1R2:ALT_F2R1:FOXOG:QSS:REF_F1R2:REF_F2R1         0/1:182,156:0.469:89:67:.:5449,4636:86:96

Which shows how the QSS field (second to last) is reported :/. I also try the SeqArray package and it also throw an error trying to read this field :/

At the moment I will tray to fix it by using read.table but is there a chance that this can be updated?

I actually also noted a similar problem when trying to read in Freebayes output. Freebayes reports several alternative alleles (if they are present) separated by commas. Only the first one is reported with VariantAnnotation.

Thanks!!

Anna

annaquaglieri16 commented 5 years ago

Hi again,

I guess this is the same problem as for the AD above which is parsed correctly.

head(VariantAnnotation::geno(vcf)$AD[,1])
$`chr1:36933772_C/T`
[1] 523   9

$`chr1:36935370_T/C`
[1] 482   7

$`chr1:36937059_A/G`
[1] 249 239

$`chr1:36939108_T/C`
[1] 377   6

$`chr1:36941395_G/C`
[1] 223   3

Cheers, Anna

vobencha commented 5 years ago

Hi @annaquaglieri16 , Thanks for the report. It will be awhile before I have a chance to look at this. Patches always welcome. Valerie

lawremi commented 5 years ago

I think the issue is that the QSS field is noted as Number=A in the header but it should be Number=R if the number of values is equal to the total number of alleles, not the number of alt alleles.

vobencha commented 5 years ago

@annaquaglieri16 have you looked into Michael's suggestion? Do you know why the QSS header is specified as Number=A instead of Number=R? This would indicate a problem with how the output was created not in how the data are read in. See section 1.2.2: https://samtools.github.io/hts-specs/VCFv4.2.pdf

annaquaglieri16 commented 5 years ago

Hi @vobencha ! Sorry I missed @lawremi 's other reply. I see, indeed in the VCF files produced by mutect, the AD field is read ok as it is specified as Number=R

##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">

but the QSS is not.

##FORMAT=<ID=QSS,Number=A,Type=Integer,Description="Sum of base quality scores for each allele">

Which I interpret it as ref and alt allele. I will try to mention this to the Mutect2 developers.

Thanks,

Anna

annaquaglieri16 commented 5 years ago

Well, actually, I figured they won't do anything on Mutect2 GATK3 but they will ask me to try the latest version GATK4 Mutect2 where all fields are different and qualities are specified differently.

So, I will keep my patch for GATK3 to read the two values correctly and add the option for GATK4 Mutect2 in my package.

Thanks for your help!

Anna

vobencha commented 5 years ago

Sounds good.