Closed saippuakauppias closed 4 years ago
Hi @saippuakauppias , I think this could be useful, thanks for the suggestion! I'll hack on it a bit to find a decent, general-purpose solution.
Hey, check out the above commit to see what I came up with. Since I couldn't be sure of any cases that were correct in all circumstances, I had to defer some decision-making to users. So, you'll have to call the function once for each punctuation repetition you want to normalize:
preprocessing.normalize_repeating_chars(text, chars=".", maxn=3)
preprocessing.normalize_repeating_chars(text, chars="*", maxn=1) # or 0
preprocessing.normalize_repeating_chars(text, chars="-", maxn=1) # or 0
...
Awesome! It's simple and good solution :)
I think, my case is:
import string
for punct in string.punctuation.replace('.', ''): # prevent remove '...'
preprocessing.normalize_repeating_chars(text, chars=punct, maxn=1)
@bdewilde, little error on slash escaping:
>>> print(normalize_repeating_chars('123', chars='\\', maxn=1))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/gpt2/DNTG/preprocessing/src/textacy/textacy/preprocessing/normalize.py", line 63, in normalize_repeating_chars
return re.sub(r"({}){{{},}}".format(re.escape(chars), maxn + 1), chars * maxn, text)
File "/usr/lib/python3.6/re.py", line 191, in sub
return _compile(pattern, flags).sub(repl, string, count)
File "/usr/lib/python3.6/re.py", line 326, in _subx
template = _compile_repl(template, pattern)
File "/usr/lib/python3.6/re.py", line 317, in _compile_repl
return sre_parse.parse_template(repl, pattern)
File "/usr/lib/python3.6/sre_parse.py", line 879, in parse_template
s = Tokenizer(source)
File "/usr/lib/python3.6/sre_parse.py", line 231, in __init__
self.__next()
File "/usr/lib/python3.6/sre_parse.py", line 245, in __next
self.string, len(self.string) - 1) from None
sre_constants.error: bad escape (end of pattern) at position 0
Hey, @bdewilde, you not fixed error with slash escaping? Its bug, I think...
Hi @saippuakauppias , I think this error results from how escape characters are handled by Python in the context of re
. Do either of these two solutions suffice?
In [8]: print(textacy.preprocessing.normalize_repeating_chars('123', chars=r'\\', maxn=1))
123
In [9]: print(textacy.preprocessing.normalize_repeating_chars('123', chars='\\\\', maxn=1))
123
context
In ordinary texts, people tend to place multiple repetitions of certain characters in order to visually decorate the text.
proposed solution
Replace repeating punctuations like:
.........
->...
*****
->*
(or remove?)------
->-
_______
->_
(or remove?)+++++
->+
(or remove?) etc.alternative solutions?
Perhaps some duplicate characters should be completely removed. But not absolutely everything (I will write right away that method
remove_punctuation
is not suitable for solving this problem).