Open davmlaw opened 3 years ago
Hi, I have implemented this myself in a fork at https://github.com/SACGF/hgvs
To resolve the gaps, pyHGVS transcripts now need to contain the cDNA_match information from the GFF files. I've made this data available for download (or you can produce your own) see https://github.com/counsyl/hgvs/issues/26#issuecomment-961629833
Instead of adding a separate code path to handle alignment gaps, I treat exons without gaps as cDNA matches with 100% alignment. This keeps all code running through the same path so should minimise future work.
The work was done on top of a lot of my existing changes (fixing other bugs etc) but if the project starts accepting pull requests again, please let me know and I can make a patch against master.
RefSeq transcript sequences can be different from the reference sequence (even if they agree with 1 build they can be different across builds). These sequences are aligned against the genome to produce exon coordinates in GFF releases.
This alignment can sometimes produce insertions / deletions (5-10% of transcripts), eg in the GFF file there is a “cDNA match” string that records the alignment, and has a “Gap” entry:
NM_015120.4 has cDNA_match Gap=M185 I3 M250 - meaning there was 185 bases matched, 3 bases inserted then back to matching. You can see how this affects PyHGVS conversion downstream from the gaps:
2:73385942 A>T: NM_015120.4(ALMS1):c.74A>T (correct) 2:73385943 A>T: NM_015120.4(ALMS1):c.75A>T (off by 3, VEP gives NM_015120.4:c.78A>T) 2:73385944 G>C: NM_015120.4(ALMS1):c.76G>C (off by 3, VEP gives NM_015120.4:c.79G>C)