SACGF / cdot

Transcript versions for HGVS libraries
MIT License
29 stars 5 forks source link

HGVS to VCF Output for HGVS nomenclature with Duplication #67

Closed Mchhu closed 9 months ago

Mchhu commented 10 months ago

Using the example code to convert HGVS to VCF, when I put in a HGVS transcript with a duplication, the output for the ref and alt is incorrect. I have tested this with other HGVS nomenclature and this seems to only be affect variants that has a duplication. Below are a couple of examples with the input and outputs. I'm not sure what the problem is here.

Input: NM_021619.3:c.1071_1076dup Output: ('NC_000009.11', 133557022, 'C', 'C')

Input: NM_181523.3:c.1299+80_1425+64dup Output: ('NC_000005.9', 67589390, 'A', 'A')

import pyhgvs
from pysam.libcfaidx import FastaFile
from cdot.pyhgvs.pyhgvs_transcript import JSONPyHGVSTranscriptFactory, RESTPyHGVSTranscriptFactory

genome = FastaFile("/cdot/GCF_000001405.25_GRCh37.p13_genomic.fna")
factory = RESTPyHGVSTranscriptFactory()
pyhgvs.parse_hgvs_name('NM_021619.3:c.1071_1076dup', genome, get_transcript=factory.get_transcript_grch37)
davmlaw commented 9 months ago

Hi, this is due to a bug in pyhgvs, see https://github.com/counsyl/hgvs/issues/50

I made a pull request in March 2020 but it hasn't been accepted, there is no activity on the project and I think it has been abandoned. Using my code, I get:

In [4]: pyhgvs.parse_hgvs_name('NM_021619.3:c.1071_1076dup', genome, get_transcript=factory.get_transcript_grch37)
Out[4]: ('NC_000009.11', 133556992, 'T', 'TCGCCGC')

In [5]: pyhgvs.parse_hgvs_name('NM_181523.3:c.1299+80_1425+64dup', genome, get_transcript=factory.get_transcript_grch37)
Out[5]: 
('NC_000005.9',
 67589388,
 'A',
 'AGATGAATTACATGTAATCAAGTCTAAAAAACTTGACACTCGTAATTACATAATTGCAATTTTAAAGATGTTTCCATGTCAGCTATTTTGTTAAACAATTGTTATTTGATTAAATACCTTATCCATTGAATTTATTTTAATCTTTCTAGGATCAAGTTGTCAAAGAAGATAATATTGAAGCTGTAGGGAAAAAATTACATGAATATAACACTCAGTTTCAAGAAAAAAGTCGAGAATATGATAGATTATATGAAGAATATACCCGCACATCCCAGGTGAGTTTTCTATGAAAATCAGATTAAAAAATAAGAGTTCTAAACTTTTAAAGACTAACATG')

I think your options are: