casics / spiral

A Python 3 module that provides functions for splitting identifiers found in source code files.
GNU General Public License v3.0
48 stars 9 forks source link

Inconsistance results in heuristic_split #3

Open cs-wangchong opened 3 years ago

cs-wangchong commented 3 years ago

Fix a bug

The bug is that Ronin may split the same identifier into different results due to the term order in the set of common_terms_with_numbers.

Reproduction

I added md5sum into the set of common_terms_with_numbers and then ran ronin.split("md5sum") several times. The splitting results were sometimes ["md5sum"] and sometimes ["md5", "sum"].

Reason & Solution

I checked the code and found that the heuristic_split function in simple_splitters.py relys on the regex expression _exceptions_re. The _exceptions_re is generated from common_terms_with_numbers without considering term order in the set. It means that if "md5" is before "md5sum" in _exceptions_re, the split result is ["md5", "sum"]; If "md5sum" is before "md5" in _exceptions_re, the split result is ["md5sum"].

Solution: Sort the terms by term length when generating _exceptions_re.

_exceptions_re = re.compile(r'(' + '|'.join(sorted(common_terms_with_numbers, key=lambda term: len(term), reverse=True)) + ')', re.I)
mhucka commented 2 years ago

Thank you for this, and my apologies for taking so long to reply. I think your solution (sorting by length) sounds like a good idea. I want to run some tests first but it does sound like this will be an improvement.