freeseek / gtc2vcf

Tools to convert Illumina IDAT/BPM/EGT/GTC and Affymetrix CEL/CHP files to VCF
MIT License
143 stars 24 forks source link

Add an explicit FORMAT/TAG for illumina genotype? #69

Open rajwanir opened 5 months ago

rajwanir commented 5 months ago

Hi @freeseek

Following up on #68, I was wondering if it may be feasible for you to include an Illumina_genotype tag in the FORMAT field? I see that based on ALLELE_A/ALLELE_B and GT the illumina_genotype can be infered. I will look up how I may be able to include this as a post-processing and then pass it on to the downstream tools that need it. But wanted to check if this is something might be straightforward to include in the standard output from gtc2vcf?

The illumina AB genotype could be either in the AA,AB,BB or NN format or it's numerical encoding as described here (https://github.com/broadinstitute/picard/blob/40bc2fd45ec3c0f438f2a4d31139c73dff792c82/src/main/java/picard/arrays/illumina/IlluminaGenotype.java#L35)

Thank you

freeseek commented 5 months ago

This would be redundant. You can compute that from ALLELE_A, ALLELE_B, and GT. What would be a use case that makes this required?

rajwanir commented 5 months ago

Agreed that it is a bit redundant and the tools should be able to infer that based on ALLELES A/B and GT. However, working with some of the older tools that were designed around processing Illumina files expects an Illumina ab genotype. For example as we were discussing in #68, the picard tools VCFtoAdpc much of it converts GT to illumina AB genotype.

I am now exploring switching to BAFRegress (https://genome.sph.umich.edu/wiki/BAFRegress) to do the contamination checks since it relies on text-based Illumina Reports and primarily needs BAF, population frequency from 1KGP (AF), and illumina ab genotype. I could annotate the VCF with AF from 1KGP and then pull BAF and AF from vcf with bcftools query command. It would streamline things so much if illumina ab genotype tag was also encoded as format field and all relevant info can be directly pulled from a single bcftools query command to feed to BAFRegress.

I see that there is a clear and simple logic to infer it based on GT and ALLELE_A, ALLELE_B but it would require additional processing before feeding the output from bcftools query to BAFRegress.

If not a FORMAT tag by default, do you know of a bcftools command that should be able to translate GT and ALLELE_A, ALLELE_B to illumina genotype on stream? I could write a couple if else as a downstream processing but would prefer something from the bcftools as I think it might be more efficient. If have any suggestions or thoughts on reconstructing the illumina genotypes on the fly, please let me know.

Thank you so much.

freeseek commented 5 months ago

I don't think you can do this straight from bcftools query but a simple command like this could do it:

bcftools query -f "[%ALLELE_A\t%ALLELE_B\t%GT\n]" <input.vcf> | awk '{x[$1]="A"; x[$2]="B"; print x[substr($3,1,1)]x[substr($3,3,1)]}'