Closed jasonwalker80 closed 5 years ago
Does every caller output a VAF-equivalent - AD, DP4, etc? (Some are calculated in different ways, I realize). Can we pass through those values, creating simple rules for resolving conflicts with a prioritized list of which ones we believe are the best?
Here are the number of entries missing AF in the FORMAT fields:
79 GT:AD
487 GT:AD:DP:DP4:FREQ:RD
23832 GT:AU:CU:DP:FDP:GU:SDP:SUBDP:TU
528 GT:BCN50:DP:DP2:DP50:FDP50:SUBDP50:TAR:TIR:TOR
1 GT:DP:DP4:FREQ:RD
I tried running GATK VariantAnnotator, Coverage and DepthPerAlleleBySample modules, but I'm failing to see where or if it's actually adding anything to the VCF. I also tried replacing the sample names with real sample names that should match those used in the CRAM file.
@tmooney suggested this: https://samtools.github.io/bcftools/howtos/plugin.fill-tags.html
@jasonwalker80 - I think that will use the AC and AN fields to give you the population genetics concept of allele frequency (number of alternate alleles in the genotype) instead of allele fraction.
@ernfrid that is correct, in the docs it specifically says INFO tags, instead of the per-sample FORMAT fields, AF, AC, AN, etc.
DP4 is a substitute for AF in some of those above fields (fwd and rev reads supporting the ref and var), but looks like more than one caller is missing that info as well. Sigh. How much money would we have to collect to get @ernfrid to add VCF support to bam-readcount?
Also, do we know which callers are not outputting the fields we need?
@tmooney is running the full exome with the latest features to retain each caller VCF. We should know by tomorrow which callers produce an allele frequency/fraction or fields that could be used to calcluate one. My initial impression from a small test data set is that pindel and strelka are missing them. FREQ
is output by varscan and AF
is output by mutect2.
Pindel is probably a lost cause. Strelka does appear to be outputting the information, just in a weird format that would require some parsing to standardize. (AU, CU, GU, TU). See: https://sites.google.com/site/strelkasomaticvariantcaller/home/somatic-variant-output
The nucleotide counts are only helpful for SNVs, we'd still need a solution for indels.
We'd eventually re-implement the GMT:Analysis::Coverage::BamReacount logic by calculating a VAF from nucleotide counts. We need bam-readcount output for pVAC-Seq anyway, so that solution has some appeal. The number of GMT, genome, joinx dependencies will be large and could be a rabbit hole.
The other challenge is writing the AF FORMAT field back to an output VCF. You suggested a python library a while back(what was the name again?), I don't believe we've tested that to be sure it is capable of writing FORMAT fields for each sample.
also, we're still using that ancient version of Pindel, right? The newer version has pindel2vcf, and from a quick glance, it appears to output some readcount info that may fit the bill: https://github.com/genome/pindel/blob/de4dedae967627f843b8d659f5c4589c798ac901/src/pindel2vcf.cpp
(to be clear, I do not know whether pindel2vcf requires the newer version, or whether it will run on the older output)
Oh, and this is the library that may or may not be useful. brentp is a good guy and will probably at least consider feature requests https://github.com/brentp/cyvcf2
I'll second that recommendation. @indraniel has had good luck with it.
I'm starting to think we tie this in with #127. @susannasiebert we've talked about this before. Thoughts on adding a new docker-ized method getting VAFs into VCF files as part of the bam-readcount (pre-pVAC-Seq) workflow?
That sound doable. I would rewrite the existing https://github.com/genome/arvados_trial/blob/master/pvacseq/bam-readcount.cwl into a python script that gets called inside of the cwl. I would need to play around with the cyvcf2 package a bit to see if it will do what we need it to.
As discussed at the meeting this morning, the goal is to standardize the per sample FORMAT fields including the below fields from bam-readcount output. These fields are a subset of the Mutect2 FORMAT fields, the only ones missing are FOXOG, GQ, PGT, PID and PL
. I can't think of a way to calculate values for those with bam-readcount alone.
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=AF,Number=1,Type=Float,Description="Allele fraction of the event in the tumor">
##FORMAT=<ID=ALT_F1R2,Number=1,Type=Integer,Description="Count of reads in F1R2 pair orientation supporting the alternate allele">
##FORMAT=<ID=ALT_F2R1,Number=1,Type=Integer,Description="Count of reads in F2R1 pair orientation supporting the alternate allele">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=QSS,Number=A,Type=Integer,Description="Sum of base quality scores for each allele">
##FORMAT=<ID=REF_F1R2,Number=1,Type=Integer,Description="Count of reads in F1R2 pair orientation supporting the reference allele">
##FORMAT=<ID=REF_F2R1,Number=1,Type=Integer,Description="Count of reads in F2R1 pair orientation supporting the reference allele">
VAFs are another output of the detect_variants workflow that are currently missing. These can be calculated from bam-readcount output, see #127. However, annotating the VCF with new VAFs has been a long-standing issue. There are GATK walker modules that could do this, but I don't think they will run outside of running HaplotypeCaller.
I think this issue still requires some discussion and investigation before development begins.