counsyl / hgvs

HGVS variant name parsing and generation
MIT License
171 stars 79 forks source link

HGVS / genome coordinate conversion does not account for cDNA alignment gaps #60

Open davmlaw opened 3 years ago

davmlaw commented 3 years ago

RefSeq transcript sequences can be different from the reference sequence (even if they agree with 1 build they can be different across builds). These sequences are aligned against the genome to produce exon coordinates in GFF releases.

This alignment can sometimes produce insertions / deletions (5-10% of transcripts), eg in the GFF file there is a “cDNA match” string that records the alignment, and has a “Gap” entry:

NC_000002.12    RefSeq  cDNA_match      73385758        73386192        431.411 +       .       ID=daa36283c6058f57b6347eb074291b21;Target=NM_015120.4 1 438 +;assembly_bases_aln=5003;assembly_bases_seq=5003;consensus_splices=44;exon_identity=0.999768;for_remapping=2;gap_count=1;identity=0.999768;idty=0.993151;matches=12925;num_ident=12925;num_mismatch=0;pct_coverage=99.9768;pct_coverage_hiqual=99.9768;pct_identity_gap=99.9768;pct_identity_ungap=100;product_coverage=1;rank=1;splices=44;weighted_identity=0.999771;Gap=M185 I3 M250

NM_015120.4 has cDNA_match Gap=M185 I3 M250 - meaning there was 185 bases matched, 3 bases inserted then back to matching. You can see how this affects PyHGVS conversion downstream from the gaps:

2:73385942 A>T: NM_015120.4(ALMS1):c.74A>T (correct) 2:73385943 A>T: NM_015120.4(ALMS1):c.75A>T (off by 3, VEP gives NM_015120.4:c.78A>T) 2:73385944 G>C: NM_015120.4(ALMS1):c.76G>C (off by 3, VEP gives NM_015120.4:c.79G>C)

davmlaw commented 3 years ago

Hi, I have implemented this myself in a fork at https://github.com/SACGF/hgvs

To resolve the gaps, pyHGVS transcripts now need to contain the cDNA_match information from the GFF files. I've made this data available for download (or you can produce your own) see https://github.com/counsyl/hgvs/issues/26#issuecomment-961629833

Instead of adding a separate code path to handle alignment gaps, I treat exons without gaps as cDNA matches with 100% alignment. This keeps all code running through the same path so should minimise future work.

The work was done on top of a lot of my existing changes (fixing other bugs etc) but if the project starts accepting pull requests again, please let me know and I can make a patch against master.