biocommons / hgvs

Python library to parse, format, validate, normalize, and map sequence variants. `pip install hgvs`
https://hgvs.readthedocs.io/
Apache License 2.0
244 stars 94 forks source link

Inconsistencies across intron/exon boundaries #655

Open cassiemk opened 1 year ago

cassiemk commented 1 year ago

We have a number of variants at the intron/exon or exon/intron boundary that return no protein change that we believe should be treated as coding because the splice site & region remain completely intact but return no var_p.

In [1]: hgvs_c = "NM_004380.2:c.3251-1dup" In [2]: var_c = parse(hgvs_c) In [3]: c_to_p(var_c) Out[3]: SequenceVariant(ac=NP_004371.2, type=p, posedit=None, gene=None)

In [4]: hgvs_c = "NM_004380.2:c.3250_3250+1insT" In [5]: var_c = parse(hgvs_c) In [6]: c_to_p(var_c) Out[6]: SequenceVariant(ac=NP_004371.2, type=p, posedit=None, gene=None)

In [7]: hgvs_c = "NM_004380.2:c.3251-1_3251insA" In [8]: var_c = parse(hgvs_c) In [9]: c_to_p(var_c) Out[9]: SequenceVariant(ac=NP_004371.2, type=p, posedit=None, gene=None)

While other variants at the boundary return a protein change. In [10]: hgvs_c = "NM_004380.2:c.3251dup" In [11]: var_c = parse(hgvs_c) In [12]: c_to_p(var_c) Out[12]: SequenceVariant(ac=NP_004371.2, type=p, posedit=(Phe1085LeufsTer2), gene=None)

It seems like it's deciding if it's coding or not based on the var_c nomenclature (the presence of +/-1 in this case) rather than biology.

katiestahl commented 1 year ago

@cassiemk @reece can likely explain this better, but I will try to give it a shot!

You are correct; the package does return no protein change for converted sequence variants based on the nomenclature when offsets are provided, like in your top 3 examples.

I believe this is working as designed, because we cannot guarantee that every splice site/region will be unaffected/remain intact by intronic variants.

I am unsure if there are plans to change this or add edge cases for specific variants where the coding regions are not affected. I will defer to Reece to comment on that.

gostachowiak commented 1 year ago

@katiestahl when there's an insertion right at the intron/exon boundary, there is a choice to make. Should the inserted material be treated as part of the coding region (because the canonical splice site and in fact the entire intron is intact), or as part of the intron (because it is adjacent to the canonical splice site).

Currently, the behavior is inconsistent.

the 4 examples from the original issue are all insertions right at the boundary. 3 of them are treated as intronic, and 1 is treated as CDS. And the difference seems to be arbitrary, based on whether the cdot nomenclature includes an intronic position or not. So a conscious decision has not yet been made.

We have a developer working on updating the logic so that insertions at the boundary are treated as CDS, and were planning a pull request sometime soon once we get all of our tests passing. This seems to be the more common choice, and is the choice that our users seem to expect.

So the immediate task would be to see if we can come to alignment about which decision is most appropriate for insertions right at the boundary. As far as I can tell, HGVS (the society) doesn't have any guidance on this situation (they don't talk much about the right decisions to make for edge cases when projecting DNA changes onto transcripts).

Reasons we think these insertions should be treated as part of the coding region:

github-actions[bot] commented 10 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] commented 10 months ago

This issue was closed because it has been stalled for 7 days with no activity.

gostachowiak commented 10 months ago

Would it be possible to re-open this issue? It is a flaw with a PR out to fix it.

github-actions[bot] commented 7 months ago

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 7 days.