SBIMB / StellarPGx

Calling star alleles in highly polymorphic pharmacogenes (e.g. CYP450 genes) by leveraging genome graph-based variant detection.
MIT License
30 stars 7 forks source link

CYP2D6 Alignment on Alt region #12

Closed NTNguyen13 closed 2 years ago

NTNguyen13 commented 2 years ago

Is your feature request related to a problem? Please describe. The current StellarPGx pipeline may not capture all CYP2D6 reads while the bam file is aligned to GRCh38.

Describe the solution you'd like Extend the capture region to Alt region.

Additional context From my understanding of StellarPGx, the first 3 steps call_snvs1, call_snvs2 and call_sv_del used the predefined coordinations from line 12 to 320 of main.nf. For CYP2D6, hg38, the coordination is chr22:42126000-42137500 and chr22:42126300-42132400, while these coordinations have covered the gene as well as its neighbor region, it still not comprehensive, because in Hg38, there are alternative regions (from genecards): chr22(ALT_REF_LOCI_2):48,826-53,137 (-) chr22(ALT_REF_LOCI_3):48,840-53,151 (-) chr22(PATCHES):4,240-8,551 (-)

Because some reads are aligned to Alt regions, the estimation of SV and some SNV may not be correct. I suggest that we should consider alt regions while calculating CYP2D6 diplotype (and some other genes if needed, too)

twesigomwedavid commented 2 years ago

@NTNguyen13, thanks for raising this. We are going to run tests regarding the reads aligning to alternate contigs and get back to you asap.

David

NTNguyen13 commented 2 years ago

Hi, may I ask about the progress of this issue? I'm not familiar with writing nextflow language, but I can try testing some approaches in bash script if you need my help.

twesigomwedavid commented 2 years ago

@NTNguyen13, thanks for the follow up.

I did some tests using short-read data. For hg38, the alternate contigs that comprise CYP2D6 and/or CYP2D7 are chr22_KB663609v1_alt:48676-53302 and chr22_KI270928v1_alt:46543-68556.

If you perform non-alt-aware read alignment (i.e. without the .alt index of hg38 -- Homo_sapiens_assembly38.fasta.64.alt), then more than half your reads would align to these alternate contig regions instead of the primary chr22 contig. However, if you have the .alt index of hg38 during alignment (i.e. perform alt-aware-alignment), then over 99.9% of all the reads align to the primary chr22 contig.

Therefore, the best solution for dealing with alternate contig alignments (if working with short-read data) is to have alignments generated using the alt-aware approach (which is the most popular). The CYP2D6 region in the alternate contigs above (and others) is different from that in the primary chr22 contig by only a few bases. Therefore with short-read data, only a small (and probably insignificant) fraction of reads align to these alternate regions when alt-aware alignment is performed. The key is to have the Homo_sapiens_assembly38.fasta.64.alt file as one of your indexes when performing alignments. This solution extends to other pharmacogenes.

That being said, the situation is different when it comes to long-read data. Since long reads are essentially full haplotypes, alt-aware alignment can favours alignment to alternate contigs – depending on the population(s) in your study – while non-alt-aware alignment could favour alignment to the primary chr22 contig (i.e. opposite for short read) for these genomes. We are in the process of adding support for long-read data to StellarPGx. Therefore we need to be conversant of these alternate contig alignments for this type of input. At present, the biggest challenge is that StellarPGx relies on GraphTyper to perform graph realignments, and the support for long-read SNV calling is still under development in GraphTyper.

NTNguyen13 commented 2 years ago

thank you for your update!