Open rajwanir opened 5 months ago
This would be redundant. You can compute that from ALLELE_A, ALLELE_B, and GT. What would be a use case that makes this required?
Agreed that it is a bit redundant and the tools should be able to infer that based on ALLELES A/B and GT. However, working with some of the older tools that were designed around processing Illumina files expects an Illumina ab genotype. For example as we were discussing in #68, the picard tools VCFtoAdpc much of it converts GT to illumina AB genotype.
I am now exploring switching to BAFRegress (https://genome.sph.umich.edu/wiki/BAFRegress) to do the contamination checks since it relies on text-based Illumina Reports and primarily needs BAF, population frequency from 1KGP (AF), and illumina ab genotype. I could annotate the VCF with AF from 1KGP and then pull BAF and AF from vcf with bcftools query
command. It would streamline things so much if illumina ab genotype tag was also encoded as format field and all relevant info can be directly pulled from a single bcftools query
command to feed to BAFRegress.
I see that there is a clear and simple logic to infer it based on GT and ALLELE_A, ALLELE_B but it would require additional processing before feeding the output from bcftools query
to BAFRegress.
If not a FORMAT tag by default, do you know of a bcftools command that should be able to translate GT and ALLELE_A, ALLELE_B to illumina genotype on stream? I could write a couple if else as a downstream processing but would prefer something from the bcftools as I think it might be more efficient. If have any suggestions or thoughts on reconstructing the illumina genotypes on the fly, please let me know.
Thank you so much.
I don't think you can do this straight from bcftools query
but a simple command like this could do it:
bcftools query -f "[%ALLELE_A\t%ALLELE_B\t%GT\n]" <input.vcf> | awk '{x[$1]="A"; x[$2]="B"; print x[substr($3,1,1)]x[substr($3,3,1)]}'
Hi @freeseek
Following up on #68, I was wondering if it may be feasible for you to include an Illumina_genotype tag in the FORMAT field? I see that based on ALLELE_A/ALLELE_B and GT the illumina_genotype can be infered. I will look up how I may be able to include this as a post-processing and then pass it on to the downstream tools that need it. But wanted to check if this is something might be straightforward to include in the standard output from gtc2vcf?
The illumina AB genotype could be either in the AA,AB,BB or NN format or it's numerical encoding as described here (https://github.com/broadinstitute/picard/blob/40bc2fd45ec3c0f438f2a4d31139c73dff792c82/src/main/java/picard/arrays/illumina/IlluminaGenotype.java#L35)
Thank you