I've made a major refactoring of src/utils/hgvs.py, and added a test suite src/tests/utils/hgvs_test.py to ensure the correctness. Redundant regex matching are removed; compiled regex patterns are organized as class attributes of global classes.
Fix to Issue#136
I've made a major refactoring of src/hub/dataload/sources/dbsnp/dbsnp_json_parser.py where the parse_one_rec function is rewritten to eliminate the shallow copies of allele-specific document fields.
Fix to Issue#137
This fix requires thorough understanding of the trim_delseq_from_hgvs() function. (Renamed to prune_redundant_seq in this PR.)
The function name suggested it would only remove the sequences between "del" and "ins" in a (legacy) delins HGVS. However, it also removed the tailing sequences if the input HGVS is a (legacy) ins, del or dup.
It was WRONG to remove the tailing sequences for an ins HGVS. This function's aim is to transform a legacy HGVS with redundant sequence into a shorter but valid HGVS. E.g. c.76_78delACT => c.76_78del (tailing ACT can be pruned), c.77_79dupCTG => c.77_79dup (tailing CTG can be pruned), and c.112_117delAGGTCAinsTG => c.112_117delinsTG (AGGTCA in the middle can be pruned). Removing the tailing sequence in an ins HGVS makes it INVALID.
Basically, this fix introduced a flag remove_ins to toggle the wrong behavior of re_ins.match(hgvs), which made it even more difficult to understand.
This version of function was later used to get the "prefix" (the word "prefix" has special meaning in HGVS, so I switched to use "stem" in this PR) of a HGVS ID when it's too long and encoding is needed. It kind of worked, but to make code clearer, it's better to create a new function for this purpose.
In this PR, function trim_delseq_from_hgvs() is rewritten and renamed to prune_redundant_seq(); a new function get_hgvs_stem() is created as a helper to encoding long HGVS IDs. A stem is a partial HGVS ID without its tailing sequence. Stemming of long Repeated Sequences HGVS IDs is included in the new function get_hgvs_stem().
Issues
This PR is going to fix 4 problems:
hgvs.trim_delseq_from_hgvs()
. (Issue#116)re
module. (Issue#139)Fixes to Issue#116 and Issue#139
I've made a major refactoring of
src/utils/hgvs.py
, and added a test suitesrc/tests/utils/hgvs_test.py
to ensure the correctness. Redundant regex matching are removed; compiled regex patterns are organized as class attributes of global classes.Fix to Issue#136
I've made a major refactoring of
src/hub/dataload/sources/dbsnp/dbsnp_json_parser.py
where theparse_one_rec
function is rewritten to eliminate the shallow copies of allele-specific document fields.Fix to Issue#137
This fix requires thorough understanding of the
trim_delseq_from_hgvs()
function. (Renamed toprune_redundant_seq
in this PR.)Its first commit was:
which had 2 problems:
c.76_78delACT
=>c.76_78del
(tailingACT
can be pruned),c.77_79dupCTG
=>c.77_79dup
(tailing CTG can be pruned), andc.112_117delAGGTCAinsTG
=>c.112_117delinsTG
(AGGTCA
in the middle can be pruned). Removing the tailing sequence in an ins HGVS makes it INVALID.Later, a fix made this function more complicated:
Basically, this fix introduced a flag
remove_ins
to toggle the wrong behavior ofre_ins.match(hgvs)
, which made it even more difficult to understand.This version of function was later used to get the "prefix" (the word "prefix" has special meaning in HGVS, so I switched to use "stem" in this PR) of a HGVS ID when it's too long and encoding is needed. It kind of worked, but to make code clearer, it's better to create a new function for this purpose.
In this PR, function
trim_delseq_from_hgvs()
is rewritten and renamed toprune_redundant_seq()
; a new functionget_hgvs_stem()
is created as a helper to encoding long HGVS IDs. A stem is a partial HGVS ID without its tailing sequence. Stemming of long Repeated Sequences HGVS IDs is included in the new functionget_hgvs_stem()
.