AstraZeneca-NGS / VarDict

VarDict
MIT License
187 stars 61 forks source link

AF in INFO field is misleading for paired samples #79

Open multimeric opened 6 years ago

multimeric commented 6 years ago

I noticed that the VCFs produced by VarDict include AF in the FORMAT as well as INFO sections: https://github.com/AstraZeneca-NGS/VarDict/blob/master/var2vcf_paired.pl#L50

I guess this is technically correct for single sample VCFs, because the allele frequency at that site is the same as the allele frequency for the sample. But for paired VCFs this seems misleading; you are setting the INFO AF to the same value as the first sample's AF: https://github.com/AstraZeneca-NGS/VarDict/blob/master/var2vcf_paired.pl#L212 (AF=$af1;). However by being in the INFO field it implies that this metric was calculated across all samples, when in fact it wasn't.

I suggest that you either: a) Remove AF from the INFO field b) Recalculate AF for paired samples, based on both of the two samples, instead of just the first

vladsavelyev commented 6 years ago

Hi Miguel,

I would argue that in paired calling it is misleading, because you usually focus on tumor data, and making it reasonable to put tumor stats into INFO. The normal match is there basically only for control.

Do the AF values in INFO break anything for you downstream?

Vlad

multimeric commented 6 years ago

I suspect that INFO fields should only be for annotations on the position, e.g. gene name, variant type, variant end point, consequence score etc, since the VCF could be merged with another where a single AF for both samples doesn't make sense.

It doesn't exactly break anything downstream, but when I combine VarDict VCFs with VCFs from other callers (using GATK's CombineVariants) it ends up having multiple AF scores; some in the Format field, and the VarDict one in the Info field

multimeric commented 6 years ago

@vladsaveliev thoughts?

vladsavelyev commented 6 years ago

I see your point that INFO should ideally only contain annotations that are specific to a genomic change, and agnostic to sample data, enabling merging of multiple VCFs.

The problem is VarDict reports other data-specific fields in INFO, like DP, VD, SSF, STATUS, etc, often duplicated in FORMAT, but might be important for back-compatibility. On top of that, there is other type of data in VCF that is specific to tumor sample, for example the FILTER column is populated based on AF and coverage too, and would become irrelevant after merging. Also, even INFO contains the standard SOMATIC flag that can't be moved into FORMAT by specification.

I'd be hesitant to remove fields for the sake of back-compatibility, but leave it to the authors to decide. I think when merging VCFs, you will anyways need to do quite a lot of careful cleanup. Besides mentioned above, the variants (especially indels) can be represented in different ways, so normalization would be required (split multiallelics, split biallelic MNP, left-align indels, etc), which can break literally all annotations. So I hoping that having to clean up AF in INFO won't make much difference. The VCF format is painful, that's for sure :(