davidmogar / cucco

Text normalization library for Python
MIT License
203 stars 27 forks source link

Fix back-port to Python 2 #17

Closed benfei closed 7 years ago

benfei commented 7 years ago

Reimplementation of the replace_characters function using regular expressions (regexes), to ease the backporting. This solution is also significantly faster for large texts (if I didn't mess up the test). I feel so stupid about my previous, overcomplicated, implementation :-(

Resolves issue #16. @davidmogar

davidmogar commented 7 years ago

Hi @feinsteinben,

I'm accepting the PR as it fixes the test, but at some point we introduced a bug for Python 2:

Python 2.7.12 (default, Jul  1 2016, 15:12:24)                                                                              [33/147]
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from normalizr import Normalizr
>>> normalizr = Normalizr(language='en')
>>> print(normalizr.normalize(u'Who let the dog out?'))                                                                             
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "normalizr/normalizr.py", line 85, in normalize
    for normalization, kwargs in self._parse_normalizations(normalizations or DEFAULT_NORMALIZATIONS):
ValueError: too many values to unpack