0xTCG / aldy

Allelic decomposition and exact genotyping of highly polymorphic and structurally variant genes
http://aldy.csail.mit.edu
Other
56 stars 20 forks source link

Discrepancies between Aldy results from BAM vs VCF files (hg19 reference genome) #74

Open paraveeown opened 6 months ago

paraveeown commented 6 months ago

Dear Aldy Team,

I analyzed targeted sequencing data of 100 PGx genes aligned to the hg19 reference genome using both BAM and VCF files in Aldy v4.4 (Python 3.9.7 on Linux 3.10.0-1160.95.1.el7.x86_64-x86_64-with-glibc2.17). Discrepancies in star allele calls between BAM and VCF of the same sample were often understandable, attributed to the fact that alleles with CNVs could not be detected from VCF.

However, some differences could not be explained by this limitation.

For instance, in CYP2C19, sample that was called 1/1 from the BAM file got 38/38 from the VCF file.

BAM analysis: 1/1 image VCF analysis: 38/38 image

Based on PharmVar: 38 = wild-type (no variant) 1 = rs3758581 (A>G)

In the genome browser, the BAM file of this sample showed homozygous G at this position. Since G is the reference allele in hg19, it wasn't shown as a variant in the VCF file. image

The BAM file showed homozygous G, so it's logical that Aldy called 1/1 for this sample. However, I'm confused about why the VCF file, which also represented homozygous G, didn't produce the same result?

By the way, I noticed an interesting difference between reference genomes. In GRCh38, the G allele at this position is considered the alternate allele, making it a variant in VCF file, unlike in hg19. image

Could it be that the program expected the G allele of CYP2C19*1 to be recognized as a variant in the VCF file (like in GRCh38)? However, because it is a reference allele in hg19 and doesn't appear in the VCF, the program skipped it?

The VCF log file also indicates that the 1 allele was removed because rs3758581 wasn't present, resulting in it being called as 38/*38 instead. image


Another gene with a similar discrepancy is CYP3A7. Sample that was called 2/2 when using the BAM file got 1/1 from the VCF file.

BAM analysis: 2/2 image VCF analysis: 1/1 image

However, sample that was called 1/2 from the BAM file also got 1/2 from the VCF file. BAM analysis: 1/2 image VCF analysis: 1/2 image

This suggested that *2 might not be detected in VCF files only when it is homozygous?

Based on PharmVar: 1 = wild-type (no variant) 2 = rs2257401 (G>C on positive strand)

In the genome browser, sample 392 with 2/2 showed homozygous C at this position. Since C is the reference allele in hg19, it did not show up in the VCF. Sample 394 with 1/2 had both G and C at this position. Since G is the alternate allele in hg19, it was included in the VCF. image

Similar to the previously mentioned CYP2C19 SNP, the C allele at this position is the reference allele in hg19, but is an alternate allele in GRCh38. image

So, again, it seems that the program might anticipate the C allele of CYP3A72 to be in the VCF file (like in GRCh38), but because it is a reference allele in hg19 and was not present in the VCF, the program did not considered it and called 1 instead.

Can anyone please check if I've understood this correctly? Or are there other possible reasons for these differences in Aldy results from the BAM and VCF files?

Feel free to ask for more information, and thank you in advance.

Paravee

inumanag commented 4 months ago

Hi @paraveeown : thank you for this issue!

Would it be possible to email me these samples (you can subsample them to the problematic regions) so that I can take a closer look at what's happening?

paraveeown commented 4 months ago

Hello @inumanag. Thank you for responding! I'll prepare the data and send them to you soon.

paraveeown commented 4 months ago

Hello Ibrahim Numanagić,

I have prepared the samples mentioned in my issue as requested. I extracted the CYP2C19 and CYP3A7 regions only. Please download the data from this Google Drive link https://drive.google.com/drive/folders/1xGH86uHyAFFqnYjUVIcUT8O1g-Qi-eZe?usp=sharing ( https://drive.google.com/drive/folders/1xGH86uHyAFFqnYjUVIcUT8O1g-Qi-eZe?usp=sharing ).

Here is a summary of the results obtained from Aldy using these samples: [image: image.png]

Feel free to ask for more information if needed. Thank you in advance for your assistance.

Best regards, Paravee

On Wed, 15 May 2024 at 04:08, Ibrahim Numanagić @.***> wrote:

Hi @paraveeown https://github.com/paraveeown : thank you for this issue!

Would it be possible to email me these samples (you can subsample them to the problematic regions) so that I can take a closer look at what's happening?

— Reply to this email directly, view it on GitHub https://github.com/0xTCG/aldy/issues/74#issuecomment-2111145538, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOQ3MC27F7X4R4R36N5HBNLZCJ4ONAVCNFSM6AAAAABF2O7K4WVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMJRGE2DKNJTHA . You are receiving this because you were mentioned.Message ID: @.***>