Open cs-wangchong opened 3 years ago
Thank you for this, and my apologies for taking so long to reply. I think your solution (sorting by length) sounds like a good idea. I want to run some tests first but it does sound like this will be an improvement.
Fix a bug
The bug is that Ronin may split the same identifier into different results due to the term order in the set of
common_terms_with_numbers
.Reproduction
I added
md5sum
into the set ofcommon_terms_with_numbers
and then ranronin.split("md5sum")
several times. The splitting results were sometimes["md5sum"]
and sometimes["md5", "sum"]
.Reason & Solution
I checked the code and found that the
heuristic_split
function in simple_splitters.py relys on the regex expression_exceptions_re
. The_exceptions_re
is generated fromcommon_terms_with_numbers
without considering term order in the set. It means that if "md5" is before "md5sum" in_exceptions_re
, the split result is["md5", "sum"]
; If "md5sum" is before "md5" in_exceptions_re
, the split result is["md5sum"]
.Solution: Sort the terms by term length when generating
_exceptions_re
.