genome / analysis-workflows

Open workflow definitions for genomic analysis from MGI at WUSM.
MIT License
102 stars 57 forks source link

Output a VAF for each variant called. #141

Closed jasonwalker80 closed 5 years ago

jasonwalker80 commented 7 years ago

VAFs are another output of the detect_variants workflow that are currently missing. These can be calculated from bam-readcount output, see #127. However, annotating the VCF with new VAFs has been a long-standing issue. There are GATK walker modules that could do this, but I don't think they will run outside of running HaplotypeCaller.

I think this issue still requires some discussion and investigation before development begins.

chrisamiller commented 7 years ago

Does every caller output a VAF-equivalent - AD, DP4, etc? (Some are calculated in different ways, I realize). Can we pass through those values, creating simple rules for resolving conflicts with a prioritized list of which ones we believe are the best?

jasonwalker80 commented 7 years ago

Here are the number of entries missing AF in the FORMAT fields:

     79 GT:AD
    487 GT:AD:DP:DP4:FREQ:RD
  23832 GT:AU:CU:DP:FDP:GU:SDP:SUBDP:TU
    528 GT:BCN50:DP:DP2:DP50:FDP50:SUBDP50:TAR:TIR:TOR
      1 GT:DP:DP4:FREQ:RD
jasonwalker80 commented 7 years ago

I tried running GATK VariantAnnotator, Coverage and DepthPerAlleleBySample modules, but I'm failing to see where or if it's actually adding anything to the VCF. I also tried replacing the sample names with real sample names that should match those used in the CRAM file.

jasonwalker80 commented 7 years ago

@tmooney suggested this: https://samtools.github.io/bcftools/howtos/plugin.fill-tags.html

ernfrid commented 7 years ago

@jasonwalker80 - I think that will use the AC and AN fields to give you the population genetics concept of allele frequency (number of alternate alleles in the genotype) instead of allele fraction.

jasonwalker80 commented 7 years ago

@ernfrid that is correct, in the docs it specifically says INFO tags, instead of the per-sample FORMAT fields, AF, AC, AN, etc.

chrisamiller commented 7 years ago

DP4 is a substitute for AF in some of those above fields (fwd and rev reads supporting the ref and var), but looks like more than one caller is missing that info as well. Sigh. How much money would we have to collect to get @ernfrid to add VCF support to bam-readcount?

chrisamiller commented 7 years ago

Also, do we know which callers are not outputting the fields we need?

jasonwalker80 commented 7 years ago

@tmooney is running the full exome with the latest features to retain each caller VCF. We should know by tomorrow which callers produce an allele frequency/fraction or fields that could be used to calcluate one. My initial impression from a small test data set is that pindel and strelka are missing them. FREQ is output by varscan and AF is output by mutect2.

chrisamiller commented 7 years ago

Pindel is probably a lost cause. Strelka does appear to be outputting the information, just in a weird format that would require some parsing to standardize. (AU, CU, GU, TU). See: https://sites.google.com/site/strelkasomaticvariantcaller/home/somatic-variant-output

jasonwalker80 commented 7 years ago

The nucleotide counts are only helpful for SNVs, we'd still need a solution for indels.

jasonwalker80 commented 7 years ago

We'd eventually re-implement the GMT:Analysis::Coverage::BamReacount logic by calculating a VAF from nucleotide counts. We need bam-readcount output for pVAC-Seq anyway, so that solution has some appeal. The number of GMT, genome, joinx dependencies will be large and could be a rabbit hole.

The other challenge is writing the AF FORMAT field back to an output VCF. You suggested a python library a while back(what was the name again?), I don't believe we've tested that to be sure it is capable of writing FORMAT fields for each sample.

chrisamiller commented 7 years ago

also, we're still using that ancient version of Pindel, right? The newer version has pindel2vcf, and from a quick glance, it appears to output some readcount info that may fit the bill: https://github.com/genome/pindel/blob/de4dedae967627f843b8d659f5c4589c798ac901/src/pindel2vcf.cpp

chrisamiller commented 7 years ago

(to be clear, I do not know whether pindel2vcf requires the newer version, or whether it will run on the older output)

chrisamiller commented 7 years ago

Oh, and this is the library that may or may not be useful. brentp is a good guy and will probably at least consider feature requests https://github.com/brentp/cyvcf2

ernfrid commented 7 years ago

I'll second that recommendation. @indraniel has had good luck with it.

jasonwalker80 commented 7 years ago

I'm starting to think we tie this in with #127. @susannasiebert we've talked about this before. Thoughts on adding a new docker-ized method getting VAFs into VCF files as part of the bam-readcount (pre-pVAC-Seq) workflow?

susannasiebert commented 7 years ago

That sound doable. I would rewrite the existing https://github.com/genome/arvados_trial/blob/master/pvacseq/bam-readcount.cwl into a python script that gets called inside of the cwl. I would need to play around with the cyvcf2 package a bit to see if it will do what we need it to.

jasonwalker80 commented 7 years ago

As discussed at the meeting this morning, the goal is to standardize the per sample FORMAT fields including the below fields from bam-readcount output. These fields are a subset of the Mutect2 FORMAT fields, the only ones missing are FOXOG, GQ, PGT, PID and PL. I can't think of a way to calculate values for those with bam-readcount alone.

##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=AF,Number=1,Type=Float,Description="Allele fraction of the event in the tumor">
##FORMAT=<ID=ALT_F1R2,Number=1,Type=Integer,Description="Count of reads in F1R2 pair orientation supporting the alternate allele">
##FORMAT=<ID=ALT_F2R1,Number=1,Type=Integer,Description="Count of reads in F2R1 pair orientation supporting the alternate allele">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=QSS,Number=A,Type=Integer,Description="Sum of base quality scores for each allele">
##FORMAT=<ID=REF_F1R2,Number=1,Type=Integer,Description="Count of reads in F1R2 pair orientation supporting the reference allele">
##FORMAT=<ID=REF_F2R1,Number=1,Type=Integer,Description="Count of reads in F2R1 pair orientation supporting the reference allele">