chartbeat-labs / textacy

NLP, before and after spaCy
https://textacy.readthedocs.io
Other
2.22k stars 250 forks source link

Convert all space symbols to one form #278

Closed saippuakauppias closed 4 years ago

saippuakauppias commented 5 years ago

context

Unicode contains many space symbols: https://www.htmlsymbols.xyz/punctuation-symbols/space-symbols

proposed solution

All space symbols need to convert to one form (default ASCII space)

bdewilde commented 5 years ago

Hi @saippuakauppias , this seems like an easy task for regular expressions. Does re.sub(r"\s+", " ", text, flags=re.UNICODE) work for you?

saippuakauppias commented 5 years ago

Not all this symbols are replaced by this regular expression :)

import re
test = '=\u00A0=\u2000=\u2001=\u2002=\u2003=\u2004=\u2005=\u2007=\u2008=\u2009=\u200A=\u200B=\u2060=\u3000=\uFEFF='
print(re.sub(r"\s+", "+", test, flags=re.UNICODE))

=+=+=+=+=+=+=+=+=+=+=+=​=⁠=+==

bdewilde commented 4 years ago

Hi @saippuakauppias , sorry about the belated reply. It looks like three of those code points don't match r"\s+": \u200B, which is a zero-width space; \u2060, which is a no-break space; and \uFEFF, which is a zero-width no-break space. I'm pretty confident that the zero-width spaces should not be replaced by single space, and according to Wikipedia:

U+2060 WORD JOINER (HTML ⁠ · WJ): encoded in Unicode since version 3.2. The word-joiner does not produce any space, and prohibits a line break at its position

So, for purposes of "normalizing whitespace", I think the thing to do here is to replace each of these three code points by an empty string, i.e.

re.sub(r"[\u200B\u2060\uFEFF]", "", text)

Does that seem reasonable to you?

saippuakauppias commented 4 years ago

Yes, I think it's good solution for this case :)