CYP2D6 genotyping issue - sub-allele collapse

byoo commented 1 month ago

Hello,

I would like to have your help on understanding the CYP2D6 genotyping result that's seemingly incorrect. I ran pb-StarPhase for a trio and observed a mendelian error as follows.

proband: 4.001/4.001 mother: 2.001/4.001 father: 2.001/4.015

I expect the proband to have one 4.015 allele, because the proband (and father) carries 3436C>A (rs28371729) which looks to be the only difference between 4.015 and *4.001.

Also, my data is phased by whatshap. I wonder if pb-StarPhase can use the phasing info from whatshap.

pbstarphase 0.11.3-8e6128d log of the proband

[2024-07-16T15:10:13.125Z DEBUG pbstarphase::cyp2d6::caller] Best_results:
[2024-07-16T15:10:13.125Z DEBUG pbstarphase::cyp2d6::caller]    5_REP6 -> 2_CYP2D6*4.001 -> 3_link_region -> 4_REP7 -> 6_spacer -> 0_CYP2D7
[2024-07-16T15:10:13.125Z DEBUG pbstarphase::cyp2d6::caller]    5_REP6 -> 2_CYP2D6*4.001 -> 3_link_region -> 4_REP7 -> 6_spacer -> 1_CYP2D7
[2024-07-16T15:10:13.125Z DEBUG pbstarphase::cyp2d6::caller] Full diplotype for CYP2D6 => "*4.001/*4.001"

Log from the father (pbstarphase 0.11.1 which I used originally)

[2024-07-15T23:14:07.887Z DEBUG pbstarphase::cyp2d6::caller] Best_results:
[2024-07-15T23:14:07.887Z DEBUG pbstarphase::cyp2d6::caller]    7_REP6 -> 2_CYP2D6*2.001 -> 4_link_region -> 5_REP7 -> 10_spacer -> 0_CYP2D7
[2024-07-15T23:14:07.887Z DEBUG pbstarphase::cyp2d6::caller]    8_REP6 -> 1_CYP2D6*4.015 -> 3_link_region -> 6_REP7 -> 9_spacer -> 0_CYP2D7
[2024-07-15T23:14:07.887Z DEBUG pbstarphase::cyp2d6::caller] Full diplotype for CYP2D6 => "*2.001/*4.015"

I'd appreciate if you advise what to do to understand it better.

Thank you.

P.S. below is the IGV screen shot of rs28371729 differentiating 4.015 from 4.001

holtjma commented 1 month ago

Hi @byoo,

So first, let's talk about the allelic difference. Both *4.001 and *4.015 are *4 sub-alleles, meaning they will have the same downstream impact when provided to a tool like PharmCAT. In other words, they are functionally identical given our current database information. There is one only difference between these two sub-alleles, which is usually non-coding and non-splice impacting since they are sub-alleles (EDIT: I see you're aware of this in your IGV image):

chr22   42127356    rs28371729  G   T   .   .   .

Also, my data is phased by whatshap. I wonder if pb-StarPhase can use the phasing info from whatshap.

The VCF is actually not used at all for the CYP2D6 calling component because it is an unreliable source in the presence of duplications/hybrid alleles, so this will not impact the results.

I expect the proband to have one 4.015 allele, because the proband (and father) carries 3436C>A (rs28371729) which looks to be the only difference between 4.015 and *4.001.

I'd appreciate if you advise what to do to understand it better.

Based on the image you shared, I suspect this is correct. Without direct access to the data, or a full verbose log file, it will be difficult to determine what might be happening. I suspect that this is a case where you have two highly similar alleles (1 SNP difference), and the algorithm is incorrectly merging them into a single allele. We have examples internally where this does not happen, so I suspect it's either an unaccounted for edge case or some complication of the data that I only speculate on at this point.

If you want help digging into it more, we will (at a minimum) need to look at a verbose log file (run with -v). It would be even more informative to have direct access to the dataset itself if that's possible.

Hope this helps! Matt

byoo commented 1 month ago

Hi Matt,

Thank you so much for your help! I have attached the verbose logs from the proband and its father just in case. I am pretty sure I can subset the BAMs and provide if needed. Please advise if BAMs would be helpful.

pbstarphase_father.log pbstarphase_proband.log

holtjma commented 1 month ago

Thanks for sharing the logs, I think I see what's going on now. I'll try to explain below, by describing what the algorithm is doing and why:

The algorithm's consensus step is initially finding three distinct versions of 4 (specifically 4.001, 4.001, and 4.015), snippets from the log I've copied below:

[2024-07-16T15:10:11.004Z DEBUG pbstarphase::cyp2d6::caller] Typing consensus #2, 0 wildcards trimmed
...
[2024-07-16T15:10:11.141Z DEBUG pbstarphase::cyp2d6::haplotyper] Collapsed calls:
[2024-07-16T15:10:11.141Z DEBUG pbstarphase::cyp2d6::haplotyper]    CYP2D6 at 0..6166: 0.00292=(18+0)/6165
[2024-07-16T15:10:11.141Z DEBUG pbstarphase::cyp2d6::haplotyper]    Converting CYP2D6 to full allele definition...
[2024-07-16T15:10:11.143Z DEBUG pbstarphase::cyp2d6::haplotyper]     WFAGraph result (0.001107802) => num_nodes: 1107, read_len: 6166, edit_distance: 1
[2024-07-16T15:10:11.143Z DEBUG pbstarphase::cyp2d6::haplotyper]     CYP2D6*4.001 -> (143, 385), (0.9931, 0.9948)
[2024-07-16T15:10:11.144Z DEBUG pbstarphase::cyp2d6::caller] Reduced CYP2D6*4.001 to *4 for merging.
[2024-07-16T15:10:11.144Z DEBUG pbstarphase::cyp2d6::caller] Typing consensus #3, 0 wildcards trimmed
...
[2024-07-16T15:10:11.283Z DEBUG pbstarphase::cyp2d6::haplotyper] Collapsed calls:
[2024-07-16T15:10:11.283Z DEBUG pbstarphase::cyp2d6::haplotyper]    CYP2D6 at 0..6167: 0.00308=(19+0)/6165
[2024-07-16T15:10:11.283Z DEBUG pbstarphase::cyp2d6::haplotyper]    Converting CYP2D6 to full allele definition...
[2024-07-16T15:10:11.286Z DEBUG pbstarphase::cyp2d6::haplotyper]     WFAGraph result (0.001300702) => num_nodes: 1107, read_len: 6167, edit_distance: 2
[2024-07-16T15:10:11.286Z DEBUG pbstarphase::cyp2d6::haplotyper]     CYP2D6*4.001 -> (143, 385), (0.9931, 0.9948)
[2024-07-16T15:10:11.286Z DEBUG pbstarphase::cyp2d6::caller] Reduced CYP2D6*4.001 to *4 for merging.
[2024-07-16T15:10:11.286Z DEBUG pbstarphase::cyp2d6::caller] Typing consensus #4, 0 wildcards trimmed
...
[2024-07-16T15:10:11.426Z DEBUG pbstarphase::cyp2d6::haplotyper] Collapsed calls:
[2024-07-16T15:10:11.426Z DEBUG pbstarphase::cyp2d6::haplotyper]    CYP2D6 at 0..6167: 0.00324=(20+0)/6165
[2024-07-16T15:10:11.426Z DEBUG pbstarphase::cyp2d6::haplotyper]    Converting CYP2D6 to full allele definition...
[2024-07-16T15:10:11.429Z DEBUG pbstarphase::cyp2d6::haplotyper]     WFAGraph result (0.001194202) => num_nodes: 1107, read_len: 6167, edit_distance: 2
[2024-07-16T15:10:11.429Z DEBUG pbstarphase::cyp2d6::haplotyper]     CYP2D6*4.015 -> (143, 385), (0.9931, 0.9948)
[2024-07-16T15:10:11.429Z DEBUG pbstarphase::cyp2d6::caller] Reduced CYP2D6*4.015 to *4 for merging.

After the initial consensus, the algorithm will merge alleles that reduce to the same base allele (*4 in this case) if they also have the same homo-polymer compressed representation.
- The reason for this is to reduce false positive errors from indels. You can actually see this happening in the example above because the assemblies found two versions of *4.001 that (likely) differ by a single base indel.
- This is also where the error is occurring in this dataset. The REF sequence for the rs28371729 variant is "GGT" and the ALT sequence is "GTT"; both of which convert into the same homo-polymer sequence of just "GT". To the merging algorithm, these three alleles qualify for merging because they all have the same base allele type (*4) and they all have the same HPC representation.

Short-term work-around: As I noted above, this won't impact downstream interpretation since those currently do not use sub-alleles, but you may encounter discrepancies like this in samples where you are comparing sub-alleles and the sample has two nearly identical, but different, sub-alleles in it. Internally, when we are benchmarking we are only using the core allele (e.g. *4), so it's possible we have missed some similar cases that you have here.

Longer-term fix: We may be able to use sub-alleles for the merging step instead of core-alleles. IIRC, we tried this before, but it created false positive calls in multiple datasets (i.e., it created errors in core diplotyping). We may be able to find some other workaround, but I'm not sure what that looks like. If you're willing to share a bamlet of the region containing CYP2D6 and CYP2D7, that may be beneficial for our testing of potential patches.

byoo commented 1 month ago

It makes sense and thank you so much for your kind help! I got an approval for sharing the bamlet. Would this region be large enough? chr22:42,125,000-42,150,000 Could you email me at byoo@cmh.edu?

Thanks, Byunggil

holtjma commented 1 month ago

I think if you can extend the left side a little more, that should be good to capture most of what's necessary:

chr22:42,115,000-42,150,000

It may be worth running your StarPhase locally on the bamlet to verify you get the same answer. I'll reach out via email here in a bit.

Matt

byoo commented 1 month ago

I followed the instruction and sent the bamlet to your email. Please let me know if you didn't receive it. (and feel free to close the issue)

holtjma commented 1 month ago

Between the dataset you provided and a couple other internal tests, we have a patch for this problem that will generate the correct diplotype for your dataset without injecting errors into our other test samples. This will come out with the next release.

byoo commented 1 month ago

Awesome! We will use the next release.

holtjma commented 1 month ago

Version 0.12.0 should resolve this issue: https://github.com/PacificBiosciences/pb-StarPhase/releases/tag/v0.12.0

PacificBiosciences / pb-StarPhase

CYP2D6 genotyping issue - sub-allele collapse #21