Closed byoo closed 1 month ago
Hi @byoo,
So first, let's talk about the allelic difference. Both *4.001
and *4.015
are *4
sub-alleles, meaning they will have the same downstream impact when provided to a tool like PharmCAT. In other words, they are functionally identical given our current database information. There is one only difference between these two sub-alleles, which is usually non-coding and non-splice impacting since they are sub-alleles (EDIT: I see you're aware of this in your IGV image):
chr22 42127356 rs28371729 G T . . .
Also, my data is phased by whatshap. I wonder if pb-StarPhase can use the phasing info from whatshap.
The VCF is actually not used at all for the CYP2D6 calling component because it is an unreliable source in the presence of duplications/hybrid alleles, so this will not impact the results.
I expect the proband to have one 4.015 allele, because the proband (and father) carries 3436C>A (rs28371729) which looks to be the only difference between 4.015 and *4.001.
I'd appreciate if you advise what to do to understand it better.
Based on the image you shared, I suspect this is correct. Without direct access to the data, or a full verbose log file, it will be difficult to determine what might be happening. I suspect that this is a case where you have two highly similar alleles (1 SNP difference), and the algorithm is incorrectly merging them into a single allele. We have examples internally where this does not happen, so I suspect it's either an unaccounted for edge case or some complication of the data that I only speculate on at this point.
If you want help digging into it more, we will (at a minimum) need to look at a verbose log file (run with -v
). It would be even more informative to have direct access to the dataset itself if that's possible.
Hope this helps! Matt
Hi Matt,
Thank you so much for your help! I have attached the verbose logs from the proband and its father just in case. I am pretty sure I can subset the BAMs and provide if needed. Please advise if BAMs would be helpful.
Thanks for sharing the logs, I think I see what's going on now. I'll try to explain below, by describing what the algorithm is doing and why:
[2024-07-16T15:10:11.004Z DEBUG pbstarphase::cyp2d6::caller] Typing consensus #2, 0 wildcards trimmed
...
[2024-07-16T15:10:11.141Z DEBUG pbstarphase::cyp2d6::haplotyper] Collapsed calls:
[2024-07-16T15:10:11.141Z DEBUG pbstarphase::cyp2d6::haplotyper] CYP2D6 at 0..6166: 0.00292=(18+0)/6165
[2024-07-16T15:10:11.141Z DEBUG pbstarphase::cyp2d6::haplotyper] Converting CYP2D6 to full allele definition...
[2024-07-16T15:10:11.143Z DEBUG pbstarphase::cyp2d6::haplotyper] WFAGraph result (0.001107802) => num_nodes: 1107, read_len: 6166, edit_distance: 1
[2024-07-16T15:10:11.143Z DEBUG pbstarphase::cyp2d6::haplotyper] CYP2D6*4.001 -> (143, 385), (0.9931, 0.9948)
[2024-07-16T15:10:11.144Z DEBUG pbstarphase::cyp2d6::caller] Reduced CYP2D6*4.001 to *4 for merging.
[2024-07-16T15:10:11.144Z DEBUG pbstarphase::cyp2d6::caller] Typing consensus #3, 0 wildcards trimmed
...
[2024-07-16T15:10:11.283Z DEBUG pbstarphase::cyp2d6::haplotyper] Collapsed calls:
[2024-07-16T15:10:11.283Z DEBUG pbstarphase::cyp2d6::haplotyper] CYP2D6 at 0..6167: 0.00308=(19+0)/6165
[2024-07-16T15:10:11.283Z DEBUG pbstarphase::cyp2d6::haplotyper] Converting CYP2D6 to full allele definition...
[2024-07-16T15:10:11.286Z DEBUG pbstarphase::cyp2d6::haplotyper] WFAGraph result (0.001300702) => num_nodes: 1107, read_len: 6167, edit_distance: 2
[2024-07-16T15:10:11.286Z DEBUG pbstarphase::cyp2d6::haplotyper] CYP2D6*4.001 -> (143, 385), (0.9931, 0.9948)
[2024-07-16T15:10:11.286Z DEBUG pbstarphase::cyp2d6::caller] Reduced CYP2D6*4.001 to *4 for merging.
[2024-07-16T15:10:11.286Z DEBUG pbstarphase::cyp2d6::caller] Typing consensus #4, 0 wildcards trimmed
...
[2024-07-16T15:10:11.426Z DEBUG pbstarphase::cyp2d6::haplotyper] Collapsed calls:
[2024-07-16T15:10:11.426Z DEBUG pbstarphase::cyp2d6::haplotyper] CYP2D6 at 0..6167: 0.00324=(20+0)/6165
[2024-07-16T15:10:11.426Z DEBUG pbstarphase::cyp2d6::haplotyper] Converting CYP2D6 to full allele definition...
[2024-07-16T15:10:11.429Z DEBUG pbstarphase::cyp2d6::haplotyper] WFAGraph result (0.001194202) => num_nodes: 1107, read_len: 6167, edit_distance: 2
[2024-07-16T15:10:11.429Z DEBUG pbstarphase::cyp2d6::haplotyper] CYP2D6*4.015 -> (143, 385), (0.9931, 0.9948)
[2024-07-16T15:10:11.429Z DEBUG pbstarphase::cyp2d6::caller] Reduced CYP2D6*4.015 to *4 for merging.
Short-term work-around: As I noted above, this won't impact downstream interpretation since those currently do not use sub-alleles, but you may encounter discrepancies like this in samples where you are comparing sub-alleles and the sample has two nearly identical, but different, sub-alleles in it. Internally, when we are benchmarking we are only using the core allele (e.g. *4), so it's possible we have missed some similar cases that you have here.
Longer-term fix: We may be able to use sub-alleles for the merging step instead of core-alleles. IIRC, we tried this before, but it created false positive calls in multiple datasets (i.e., it created errors in core diplotyping). We may be able to find some other workaround, but I'm not sure what that looks like. If you're willing to share a bamlet of the region containing CYP2D6 and CYP2D7, that may be beneficial for our testing of potential patches.
It makes sense and thank you so much for your kind help! I got an approval for sharing the bamlet. Would this region be large enough? chr22:42,125,000-42,150,000 Could you email me at byoo@cmh.edu?
Thanks, Byunggil
I think if you can extend the left side a little more, that should be good to capture most of what's necessary:
chr22:42,115,000-42,150,000
It may be worth running your StarPhase locally on the bamlet to verify you get the same answer. I'll reach out via email here in a bit.
Matt
I followed the instruction and sent the bamlet to your email. Please let me know if you didn't receive it. (and feel free to close the issue)
Between the dataset you provided and a couple other internal tests, we have a patch for this problem that will generate the correct diplotype for your dataset without injecting errors into our other test samples. This will come out with the next release.
Awesome! We will use the next release.
Version 0.12.0 should resolve this issue: https://github.com/PacificBiosciences/pb-StarPhase/releases/tag/v0.12.0
Hello,
I would like to have your help on understanding the CYP2D6 genotyping result that's seemingly incorrect. I ran pb-StarPhase for a trio and observed a mendelian error as follows.
proband: 4.001/4.001 mother: 2.001/4.001 father: 2.001/4.015
I expect the proband to have one 4.015 allele, because the proband (and father) carries 3436C>A (rs28371729) which looks to be the only difference between 4.015 and *4.001.
Also, my data is phased by whatshap. I wonder if pb-StarPhase can use the phasing info from whatshap.
pbstarphase 0.11.3-8e6128d log of the proband
Log from the father (pbstarphase 0.11.1 which I used originally)
I'd appreciate if you advise what to do to understand it better.
Thank you.
P.S. below is the IGV screen shot of rs28371729 differentiating 4.015 from 4.001