Illumina / strelka

Strelka2 germline and somatic small variant caller
GNU General Public License v3.0
355 stars 102 forks source link

Scripts to extract variant allele frequency (VAF) #234

Open maximus3219 opened 11 months ago

maximus3219 commented 11 months ago

Since variant allele frequency (VAF), allele depth (AD), depth (DP) is the fundamental information to interpret NGS data, but unfortunately it is not readily available in the outputs from Strelka. If there is no plan to incorporate such findings in the outputs, can you provide the bash script as to extract such information and output in a separate column, or directly filter the variants based on the values of VAF, AD and DP? bcftools can filter such information directly if such information is available directly from INFO or FORMAT field e.g. bcftools filter -i FORMAT/AF[1] >0.05 input.vcf.gz

But unfortunately extracting information is extremely complicated as stated in the manual: refCounts = Value of FORMAT column $REF + “U” (e.g. if REF="A" then use the value in FOMRAT/AU) altCounts = Value of FORMAT column $ALT + “U” (e.g. if ALT="T" then use the value in FOMRAT/TU) tier1RefCounts = First comma-delimited value from $refCounts tier1AltCounts = First comma-delimited value from $altCounts Somatic allele freqeuncy is $tier1AltCounts / ($tier1AltCounts + $tier1RefCounts)

How exactly can I implement the above pseudocode in the bash script with bcftools or other tools?

I have searched hundreds of webpage, and there is no one giving solutions or even discussing it!!

juliawiggeshoff commented 9 months ago

@maximus3219 I don't know if you are still interested but I had the same problem last week. I wrote a Python script to calculate VAF for indels and snvs from the somatic VCF files. I couldn't get it done with bcftools either, but here is the script. It calculates the VAF for each variant and includes this information for the normal and tumour samples in the final output vcf. Usage instructions are on the README.md