Open G-kodes opened 3 years ago
It seems the biggest issue here is the lack of consistency of INFO tag usage between datasets from multiple sources. While neither tag definition is overtly incorrect in its usage, they ARE mutually exclusive in terms of merging in the above-mentioned scenario. This also further highlights that INFO tags are unreliable in merge applications. I think the best solution going forward is to remove all INFO tags for this reason.
By using vcftools --recode
flag, we can re-format the files and invalidate the INFO tags without having to explicitly exclude each and every tag by name in an exhaustive approach. Unfortunately, this does not remove the TAG definition which will cause some issues downstream.
This may be a good excuse to implement a full-scale standardization step before the LIFTOVER process. This would have to include:
bcftools annotate
can do this)gatk SelectVariants
can do this. Is currently located in LIFTOVER rule)picard FixVcfHeader
can do this. Also currently in LIFTOVER process.)This approach would also benefit greatly in terms of reducing code bloat downstream.
Describe the bug It has come to my attention while debugging for Issue #5 that depending on the VCF files used as raw input, some META INFO tags in the files are shared in name but not in their definition. For example, in the 1000g dataset, the INFO tag AFR_AF (Allele frequency in AFR population), is defined as follows:
While the same tag in the SAHGP dataset is defined as follows:
In this case, they both define the AFR_AF tag as a float (
Type=Float
), however, 1000g defines it as containing a number per each alternate allele (Number=A
) while SAHGP defines the same tag as containing only 1 entry (Number=1
). In some cases where one dataset contains an alternate allele which is not present in another, such as 1000g containing an extra Allele and SAHGP not, this will causebcftools merge
command (ALL_COLLATE
process) to fail, citing an INFO tag of different lengths which cannot be merged.