biocommons / hgvs

Python library to parse, format, validate, normalize, and map sequence variants. `pip install hgvs`
https://hgvs.readthedocs.io/
Apache License 2.0
241 stars 94 forks source link

p_to_* reverse translation methods #265

Closed reece closed 7 months ago

reece commented 9 years ago

Originally reported by: Brian Craft (Bitbucket: briancraft, GitHub: Unknown)


Mapping protein to other coords would be useful.


reece commented 7 years ago

Problem statement: Given a p. variant, return a list of c. variants that translate to that p. variant.

Most p. variants are consistent with a very large number of c. variants with varying complexity. For example, a p. variant at AA position 1 might be consistent with a SNV c. variant at position 1, 2, or 3, or a multi-nucleotide variant 1nt, 2nt, or 3nt change. It is also consistent with a very large set of indels that span that region. In the most diabolical cases, a c. variant might (in principle) be predicted to alter splicing to produce a specified variant.

Another issue is that users may want c. variants that are within a single exon (i.e., do not span exon-intron boundaries). This filtering might be better supported as post-processing step.

Compound variants (i.e., distinct in-phase variants) create yet another kind of complexity.

Finally, the combinatorial complexity of reverse translation for even small indels will grow quickly. Imagine an insert of SAT. Each AA might derive from A<= GC[ACGT], S <= UC[ACGT], T <= AC[ACGT], or 64 combinations.

So, in order to implement this issue, we need to clearly define the problem we're solving (and therfore which problem classes we're excluding). A clearer set of requirements may imply parameters to the revtrans process that constrain the solution set (e.g., max_sub_len or max_ins_len), or a desire to use degenerate NTs to reduce combinatorial complexity.

wlymanambry commented 1 year ago

Is there anyway to get the corresponding genomic positions for a given P.? I understand your point above about the complexity of the C. equivalent but couldn't we at least capture the potential genomic positions affected or no?

I realize it could be ambiguous but it could be isolated to a specific range.

github-actions[bot] commented 7 months ago

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] commented 7 months ago

This issue was closed because it has been stalled for 7 days with no activity.