Tuks-ICMM / Pharmacogenetic-Analysis-Pipeline

A Snakemake powered pipeline developed to perform variant-effect-prediction and frequency analysis given multiple Variant Call Format datasets. This has been developed in partial fulfilment of a MSc in Bioinformatics at the University of Pretoria by Graeme Ford.
https://tuks-icmm.github.io/Pharmacogenetic-Analysis-Pipeline/
Creative Commons Attribution 4.0 International
0 stars 0 forks source link

[BUG] | Clashing META header definitions can cause `ALL_COLLATE` crash #6

Open G-kodes opened 3 years ago

G-kodes commented 3 years ago

Describe the bug It has come to my attention while debugging for Issue #5 that depending on the VCF files used as raw input, some META INFO tags in the files are shared in name but not in their definition. For example, in the 1000g dataset, the INFO tag AFR_AF (Allele frequency in AFR population), is defined as follows:

##INFO=<ID=AFR_AF,Number=A,Type=Float,Description="Allele frequency in the AFR populations calculated from AC an

While the same tag in the SAHGP dataset is defined as follows:

##INFO=<ID=AFR_AF,Number=1,Type=Float,Description="Allele Frequency for samples from AFR based on AC/AN">

In this case, they both define the AFR_AF tag as a float (Type=Float), however, 1000g defines it as containing a number per each alternate allele (Number=A) while SAHGP defines the same tag as containing only 1 entry (Number=1). In some cases where one dataset contains an alternate allele which is not present in another, such as 1000g containing an extra Allele and SAHGP not, this will cause bcftools merge command (ALL_COLLATE process) to fail, citing an INFO tag of different lengths which cannot be merged.

G-kodes commented 3 years ago

It seems the biggest issue here is the lack of consistency of INFO tag usage between datasets from multiple sources. While neither tag definition is overtly incorrect in its usage, they ARE mutually exclusive in terms of merging in the above-mentioned scenario. This also further highlights that INFO tags are unreliable in merge applications. I think the best solution going forward is to remove all INFO tags for this reason.

G-kodes commented 3 years ago

By using vcftools --recode flag, we can re-format the files and invalidate the INFO tags without having to explicitly exclude each and every tag by name in an exhaustive approach. Unfortunately, this does not remove the TAG definition which will cause some issues downstream.

This may be a good excuse to implement a full-scale standardization step before the LIFTOVER process. This would have to include:

This approach would also benefit greatly in terms of reducing code bloat downstream.