Fixes to Issue#116, Issue#136, Issue#137, and Issue#139

Issues

This PR is going to fix 4 problems:

Redundant regex matching in hgvs.trim_delseq_from_hgvs(). (Issue#116)
Uncompiled regex patterns cause lookup overhead in re module. (Issue#139)
Data inconsistency in ~2000 dbsnp documents due to shallow copies. (Issue#136)
158 extremely long HGVS IDs in the form of Repeated Sequences will not be stored in MongoDB. (Issue#137)

Fixes to Issue#116 and Issue#139

I've made a major refactoring of src/utils/hgvs.py, and added a test suite src/tests/utils/hgvs_test.py to ensure the correctness. Redundant regex matching are removed; compiled regex patterns are organized as class attributes of global classes.

Fix to Issue#136

I've made a major refactoring of src/hub/dataload/sources/dbsnp/dbsnp_json_parser.py where the parse_one_rec function is rewritten to eliminate the shallow copies of allele-specific document fields.

Fix to Issue#137

This fix requires thorough understanding of the trim_delseq_from_hgvs() function. (Renamed to prune_redundant_seq in this PR.)

Its first commit was:

def trim_delseq_from_hgvs(hgvs):
    re_delins = re.compile("(.*del)[A-Z]+(ins.*)")
    re_ins = re.compile("(.*ins)[A-Z]+$")
    re_del = re.compile("(.*del)[A-Z]+$")
    re_dup = re.compile("(.*dup)[A-Z]+$")
    if re_delins.match(hgvs):
        hgvs = "".join(re_delins.match(hgvs).groups())
    elif re_ins.match(hgvs):
        hgvs = "".join(re_ins.match(hgvs).groups())
    elif re_del.match(hgvs):
        hgvs = "".join(re_del.match(hgvs).groups())
    elif re_dup.match(hgvs):
        hgvs = "".join(re_dup.match(hgvs).groups())

    return hgvs

which had 2 problems:

The function name suggested it would only remove the sequences between "del" and "ins" in a (legacy) delins HGVS. However, it also removed the tailing sequences if the input HGVS is a (legacy) ins, del or dup.
It was WRONG to remove the tailing sequences for an ins HGVS. This function's aim is to transform a legacy HGVS with redundant sequence into a shorter but valid HGVS. E.g. c.76_78delACT => c.76_78del (tailing ACT can be pruned), c.77_79dupCTG => c.77_79dup (tailing CTG can be pruned), and c.112_117delAGGTCAinsTG => c.112_117delinsTG (AGGTCA in the middle can be pruned). Removing the tailing sequence in an ins HGVS makes it INVALID.

Later, a fix made this function more complicated:

def trim_delseq_from_hgvs(hgvs, remove_ins=False):
    re_delins = re.compile("(.*del)[A-Z]+(ins.*)")
    re_ins = re.compile("(.*ins)[A-Z]+$")
    re_del = re.compile("(.*del)[A-Z]+$")
    re_dup = re.compile("(.*dup)[A-Z]+$")
    if re_delins.match(hgvs):
        hgvs = "".join(re_delins.match(hgvs).groups())
    elif remove_ins and re_ins.match(hgvs):
        hgvs = "".join(re_ins.match(hgvs).groups())
    elif re_del.match(hgvs):
        hgvs = "".join(re_del.match(hgvs).groups())
    elif re_dup.match(hgvs):
        hgvs = "".join(re_dup.match(hgvs).groups())

    return hgvs

Basically, this fix introduced a flag remove_ins to toggle the wrong behavior of re_ins.match(hgvs), which made it even more difficult to understand.

This version of function was later used to get the "prefix" (the word "prefix" has special meaning in HGVS, so I switched to use "stem" in this PR) of a HGVS ID when it's too long and encoding is needed. It kind of worked, but to make code clearer, it's better to create a new function for this purpose.

In this PR, function trim_delseq_from_hgvs() is rewritten and renamed to prune_redundant_seq(); a new function get_hgvs_stem() is created as a helper to encoding long HGVS IDs. A stem is a partial HGVS ID without its tailing sequence. Stemming of long Repeated Sequences HGVS IDs is included in the new function get_hgvs_stem().

biothings / myvariant.info