Open rajwanir opened 1 month ago
The probes assay an A/G SNP at the site, but the reference genome you are using contains a C at the same position. So the two probes are testing for two alleles and neither of these are the reference allele. Indeed, if you look at the INFO field, you have ALLELE_A=1;ALLELE_B=2
. It would not be appropriate to split this record in two as you would lose the BAF
and LRR
information for this locus
Thanks so much. That was my guess too based on ALLELE_A=1;ALLELE_B=2
in the INFO but wanted to confirm. And the marker to spot such variants is that neither ALLELE_A nor ALLELE_B matches reference?
The issue is that in our internal pipeline we use plink/1.9 downstream which doesn't support multi-allelic records. I naively leaned towards splitting the record into bialllelic to retain both alleles. However, now that I understand it's truely assaying ALLELE_A vs ALLELE_B and not REF vs ALLELE_A/ALLELE_B, splitting or even the REF/ALT representation is incorrect in this case without taking into account the INFO field. Thanks for adding the ALLELE_A/ALLELE_B tag in the INFO field. Not sure what would be the right way to import these variants into plink/1.9 BED.
Thank you.
If you only care about genotypes, you can split the variant in two bi-allelic variants. you will end up with two variants in perfect LD. Depending on your downstream workflow that might not be such a bad thing
Thanks a lot for the clarification.
Hi @freeseek
Curious if it may be appropriate and feasible for you to implement a special handling for ALLELE_A=1;ALLELE_B=2
case? Instead of setting REF to REF, set ALLELE_A to REF for these records? If it seems odd default behaviour then can be hopefully parameterized by a switch e.g.--upon-noallele-match-to-ref
.
I see that this might break the REF/ALT convention in VCF but these records are pretty edge case anyway. Let me know what you think?
It would be incorrect to set one of the two alleles to the reference allele and it would lead to mismatches when comparing to data generated through other technologies such as next generation sequencing. If this behavior is something that you feel strongly that you need, I would advise to write a separate plugin to post-process the BCFtools/gtc2vcf output VCF
Understandable. Thanks so much.
Hi @freeseek
I notice that some MNV variants are often assayed using Infinium arrays. These MNV records are represented with a single record in manifest with a single allele but gets translated into multi-allelic in the gtc2vcf. My understanding is that a single probe can only interogate a single allele. May I please seek your help in understanding why and how these MNV records gets translated into multi-allelics?
For example:
CSV manifest record:
VCF record:
CHROM | POS | ID | REF | ALT | QUAL | FILTER | INFO
-- | -- | -- | -- | -- | -- | -- | -- chr5 | 112838101 | 5:112838101_MNV | C | A,G | . | . | GC=0.4125;ALLELE_A=1;ALLELE_B=2;FRAC_A=0.310924;FRAC_C=0.193277;FRAC_G=0.235294;FRAC_T=0.260504;NORM_ID=9;BEADSET_ID=1984;ASSAY_TYPE=0
From the chip: ftp://webdata@ussd-ftp.illumina.com/Public_Docs/Genotyping_Array_Support_Files/Global%20Screening%20Array/Global%20Screening%20Array%20v3/GSAv3%2BConfluence/NCI_custom_booster_20032937X371431_A1.csv
The behaviour of Illumina Dragen array gtc-to-vcf is identical (i.e. it also outputs multi-allelic variant in this scenerio).
Thank you.