[BUG] | Clashing META header definitions can cause `ALL_COLLATE` crash

G-kodes commented 3 years ago

Describe the bug It has come to my attention while debugging for Issue #5 that depending on the VCF files used as raw input, some META INFO tags in the files are shared in name but not in their definition. For example, in the 1000g dataset, the INFO tag AFR_AF (Allele frequency in AFR population), is defined as follows:

##INFO=<ID=AFR_AF,Number=A,Type=Float,Description="Allele frequency in the AFR populations calculated from AC an

While the same tag in the SAHGP dataset is defined as follows:

##INFO=<ID=AFR_AF,Number=1,Type=Float,Description="Allele Frequency for samples from AFR based on AC/AN">

In this case, they both define the AFR_AF tag as a float (Type=Float), however, 1000g defines it as containing a number per each alternate allele (Number=A) while SAHGP defines the same tag as containing only 1 entry (Number=1). In some cases where one dataset contains an alternate allele which is not present in another, such as 1000g containing an extra Allele and SAHGP not, this will cause bcftools merge command (ALL_COLLATE process) to fail, citing an INFO tag of different lengths which cannot be merged.

G-kodes commented 3 years ago

It seems the biggest issue here is the lack of consistency of INFO tag usage between datasets from multiple sources. While neither tag definition is overtly incorrect in its usage, they ARE mutually exclusive in terms of merging in the above-mentioned scenario. This also further highlights that INFO tags are unreliable in merge applications. I think the best solution going forward is to remove all INFO tags for this reason.

We do not use INFO tags in the current pipeline so the information contained therein is redundant.
Their content can always be re-generated as needed so removing them will not incur data loss.

G-kodes commented 3 years ago

By using vcftools --recode flag, we can re-format the files and invalidate the INFO tags without having to explicitly exclude each and every tag by name in an exhaustive approach. Unfortunately, this does not remove the TAG definition which will cause some issues downstream.

This may be a good excuse to implement a full-scale standardization step before the LIFTOVER process. This would have to include:

Stripping all INFO tags. (bcftools annotate can do this)
Filter variants to remove complex variants we cannot yet analyze. (gatk SelectVariants can do this. Is currently located in LIFTOVER rule)
Repair/Validate the VCF Header for downstream use. (picard FixVcfHeader can do this. Also currently in LIFTOVER process.)

This approach would also benefit greatly in terms of reducing code bloat downstream.

Tuks-ICMM / Pharmacogenetic-Analysis-Pipeline

[BUG] | Clashing META header definitions can cause `ALL_COLLATE` crash #6