Preprocessing: replace repeating punctuation characters

chartbeat-labs / textacy

NLP, before and after spaCy

https://textacy.readthedocs.io

Other

2.22k stars 250 forks source link

Preprocessing: replace repeating punctuation characters #275

Closed saippuakauppias closed 4 years ago

saippuakauppias commented 5 years ago

context

In ordinary texts, people tend to place multiple repetitions of certain characters in order to visually decorate the text.

proposed solution

Replace repeating punctuations like: ......... -> ... ***** -> * (or remove?) ------ -> - _______ -> _ (or remove?) +++++ -> + (or remove?) etc.

alternative solutions?

Perhaps some duplicate characters should be completely removed. But not absolutely everything (I will write right away that method remove_punctuation is not suitable for solving this problem).

bdewilde commented 5 years ago

Hi @saippuakauppias , I think this could be useful, thanks for the suggestion! I'll hack on it a bit to find a decent, general-purpose solution.

bdewilde commented 5 years ago

Hey, check out the above commit to see what I came up with. Since I couldn't be sure of any cases that were correct in all circumstances, I had to defer some decision-making to users. So, you'll have to call the function once for each punctuation repetition you want to normalize:

preprocessing.normalize_repeating_chars(text, chars=".", maxn=3)
preprocessing.normalize_repeating_chars(text, chars="*", maxn=1)  # or 0
preprocessing.normalize_repeating_chars(text, chars="-", maxn=1)  # or 0
...

saippuakauppias commented 5 years ago

Awesome! It's simple and good solution :)

I think, my case is:

import string
for punct in string.punctuation.replace('.', ''): # prevent remove '...'
    preprocessing.normalize_repeating_chars(text, chars=punct, maxn=1)

saippuakauppias commented 4 years ago

@bdewilde, little error on slash escaping:

>>> print(normalize_repeating_chars('123', chars='\\', maxn=1))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/gpt2/DNTG/preprocessing/src/textacy/textacy/preprocessing/normalize.py", line 63, in normalize_repeating_chars
    return re.sub(r"({}){{{},}}".format(re.escape(chars), maxn + 1), chars * maxn, text)
  File "/usr/lib/python3.6/re.py", line 191, in sub
    return _compile(pattern, flags).sub(repl, string, count)
  File "/usr/lib/python3.6/re.py", line 326, in _subx
    template = _compile_repl(template, pattern)
  File "/usr/lib/python3.6/re.py", line 317, in _compile_repl
    return sre_parse.parse_template(repl, pattern)
  File "/usr/lib/python3.6/sre_parse.py", line 879, in parse_template
    s = Tokenizer(source)
  File "/usr/lib/python3.6/sre_parse.py", line 231, in __init__
    self.__next()
  File "/usr/lib/python3.6/sre_parse.py", line 245, in __next
    self.string, len(self.string) - 1) from None
sre_constants.error: bad escape (end of pattern) at position 0

saippuakauppias commented 4 years ago

Hey, @bdewilde, you not fixed error with slash escaping? Its bug, I think...

bdewilde commented 4 years ago

Hi @saippuakauppias , I think this error results from how escape characters are handled by Python in the context of re. Do either of these two solutions suffice?

In [8]: print(textacy.preprocessing.normalize_repeating_chars('123', chars=r'\\', maxn=1))
123

In [9]: print(textacy.preprocessing.normalize_repeating_chars('123', chars='\\\\', maxn=1))
123