biocommons / hgvs

Python library to parse, format, validate, normalize, and map sequence variants. `pip install hgvs`
https://hgvs.readthedocs.io/
Apache License 2.0
236 stars 94 forks source link

require points or ranges as appropriate for edit type #279

Open reece opened 8 years ago

reece commented 8 years ago

Originally reported by: Reece Hart (Bitbucket: reece, GitHub: reece)


The parser decomposes PosEdits into Positions and Edits.

Positions are modeled as Intervals with start and end. Point positions are converted to Intervals with start==end. The current grammar therefore allows point positions and ranges to be accepted interchangeably when it should not.

Two specific consequences of this design are that the grammar accepts SNVs with a range and insertions with a point position. Examples (both of which are incorrect):

#!python

>>> hp.parse_hgvs_variant('NM_004260.3:n.2338insC')
SequenceVariant(ac=NM_004260.3, type=n, posedit=2338insC)

>>> hp.parse_hgvs_variant("NM_001637.3:c.1582_1583G>A")
SequenceVariant(ac=NM_001637.3, type=c, posedit=1582_1583G>A)

Insertions should always require a range (e.g., 2338_2339) and substitutions should always require a point position.


reece commented 7 years ago

Original comment by Reece Hart (Bitbucket: reece, GitHub: reece):


This is doable, but hard. The major challenge is that there are a large number of coordinate types (simple, base-offset with seq start datum, base-offset with cds start datum, base-offset with cds end datum), position types (range, interval), and uncertainty. Coordinate types are associated withe variant type (c,g,m,n,r,p), and position types are associated with the edit type (del, ins, etc). Addressing this issue requires enumerating all combinations.

We should tackle this, but there are enough big changes in 0.5.0 currently, and this would be a fairly big change. Let's defer to a future release.

github-actions[bot] commented 6 months ago

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 7 days.

davmlaw commented 6 months ago

These currently fail validation:

validate(hp.parse_hgvs_variant('NM_004260.3:n.2338insC'))
HGVSInvalidVariantError: insertion length must be 1
validate(hp.parse_hgvs_variant("NM_001637.3:c.1582_1583G>A"))

HGVSInvalidVariantError: NM_001637.3:c.1582_1583G>A: Variant reference (G) does not agree with reference sequence (GG)

Are we ok with validation being done in validate() not the parser? In which case we can just close this issue?

Or - if it needs to be fixed, what should we do?

github-actions[bot] commented 3 months ago

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 7 days.

davmlaw commented 3 months ago

I can probably fix this if we can agree where (see bullet points above)