freeseek / gtc2vcf

Tools to convert Illumina IDAT/BPM/EGT/GTC and Affymetrix CEL/CHP files to VCF
MIT License
143 stars 24 forks source link

Is it possible to resolve multi-nucleotide variants (MNV) to biallelic records? #77

Open rajwanir opened 1 month ago

rajwanir commented 1 month ago

Hi @freeseek

I notice that some MNV variants are often assayed using Infinium arrays. These MNV records are represented with a single record in manifest with a single allele but gets translated into multi-allelic in the gtc2vcf. My understanding is that a single probe can only interogate a single allele. May I please seek your help in understanding why and how these MNV records gets translated into multi-allelics?

For example:

CSV manifest record:

IlmnID Name IlmnStrand SNP AddressA_ID AlleleA_ProbeSeq AddressB_ID AlleleB_ProbeSeq GenomeBuild Chr MapInfo Ploidy Species Source SourceVersion SourceStrand SourceSeq TopGenomicSeq BeadSetID Exp_Clusters Intensity_Only RefStrand
5:112838101_MNV-0_B_R_2716756713 5:112838101_MNV BOT [T/C] 37787959 CTCTCCAAACTTCTATCTTTTTCAGAACGAGAACTATCTAAGCTTCCTCT     38 5 112838101 diploid Homo sapiens clinvar 0 BOT TCCGCGTTCTCTCTCCAAACTTCTATCTTTTTCAGAACGAGAACTATCTAAGCTTCCTCT[T/C]NNNAGGAGCTGGGTAACACTGTAGTATTCAAATATGGTGAAAGGACAGTCATGTTGCCAG CTGGCAACATGACTGTCCTTTCACCATATTTGAATACTACAGTGTTACCCAGCTCCTNNN[A/G]AGAGGAAGCTTAGATAGTTCTCGTTCTGAAAAAGATAGAAGTTTGGAGAGAGAACGCGGA 1984 3 0 -

VCF record:

CHROM | POS | ID | REF | ALT | QUAL | FILTER | INFO

-- | -- | -- | -- | -- | -- | -- | -- chr5 | 112838101 | 5:112838101_MNV | C | A,G | . | . | GC=0.4125;ALLELE_A=1;ALLELE_B=2;FRAC_A=0.310924;FRAC_C=0.193277;FRAC_G=0.235294;FRAC_T=0.260504;NORM_ID=9;BEADSET_ID=1984;ASSAY_TYPE=0

From the chip: ftp://webdata@ussd-ftp.illumina.com/Public_Docs/Genotyping_Array_Support_Files/Global%20Screening%20Array/Global%20Screening%20Array%20v3/GSAv3%2BConfluence/NCI_custom_booster_20032937X371431_A1.csv

The behaviour of Illumina Dragen array gtc-to-vcf is identical (i.e. it also outputs multi-allelic variant in this scenerio).

Thank you.

freeseek commented 1 month ago

The probes assay an A/G SNP at the site, but the reference genome you are using contains a C at the same position. So the two probes are testing for two alleles and neither of these are the reference allele. Indeed, if you look at the INFO field, you have ALLELE_A=1;ALLELE_B=2. It would not be appropriate to split this record in two as you would lose the BAF and LRR information for this locus

rajwanir commented 1 month ago

Thanks so much. That was my guess too based on ALLELE_A=1;ALLELE_B=2 in the INFO but wanted to confirm. And the marker to spot such variants is that neither ALLELE_A nor ALLELE_B matches reference?

The issue is that in our internal pipeline we use plink/1.9 downstream which doesn't support multi-allelic records. I naively leaned towards splitting the record into bialllelic to retain both alleles. However, now that I understand it's truely assaying ALLELE_A vs ALLELE_B and not REF vs ALLELE_A/ALLELE_B, splitting or even the REF/ALT representation is incorrect in this case without taking into account the INFO field. Thanks for adding the ALLELE_A/ALLELE_B tag in the INFO field. Not sure what would be the right way to import these variants into plink/1.9 BED.

Thank you.

freeseek commented 1 month ago

If you only care about genotypes, you can split the variant in two bi-allelic variants. you will end up with two variants in perfect LD. Depending on your downstream workflow that might not be such a bad thing

rajwanir commented 1 month ago

Thanks a lot for the clarification.

rajwanir commented 1 month ago

Hi @freeseek

Curious if it may be appropriate and feasible for you to implement a special handling for ALLELE_A=1;ALLELE_B=2 case? Instead of setting REF to REF, set ALLELE_A to REF for these records? If it seems odd default behaviour then can be hopefully parameterized by a switch e.g.--upon-noallele-match-to-ref.

I see that this might break the REF/ALT convention in VCF but these records are pretty edge case anyway. Let me know what you think?

freeseek commented 1 month ago

It would be incorrect to set one of the two alleles to the reference allele and it would lead to mismatches when comparing to data generated through other technologies such as next generation sequencing. If this behavior is something that you feel strongly that you need, I would advise to write a separate plugin to post-process the BCFtools/gtc2vcf output VCF

rajwanir commented 1 month ago

Understandable. Thanks so much.