Closed saippuakauppias closed 4 years ago
Hi @saippuakauppias , this seems like an easy task for regular expressions. Does re.sub(r"\s+", " ", text, flags=re.UNICODE)
work for you?
Not all this symbols are replaced by this regular expression :)
import re
test = '=\u00A0=\u2000=\u2001=\u2002=\u2003=\u2004=\u2005=\u2007=\u2008=\u2009=\u200A=\u200B=\u2060=\u3000=\uFEFF='
print(re.sub(r"\s+", "+", test, flags=re.UNICODE))
=+=+=+=+=+=+=+=+=+=+=+===+==
Hi @saippuakauppias , sorry about the belated reply. It looks like three of those code points don't match r"\s+"
: \u200B
, which is a zero-width space; \u2060
, which is a no-break space; and \uFEFF
, which is a zero-width no-break space. I'm pretty confident that the zero-width spaces should not be replaced by single space, and according to Wikipedia:
U+2060 WORD JOINER (HTML · WJ): encoded in Unicode since version 3.2. The word-joiner does not produce any space, and prohibits a line break at its position
So, for purposes of "normalizing whitespace", I think the thing to do here is to replace each of these three code points by an empty string, i.e.
re.sub(r"[\u200B\u2060\uFEFF]", "", text)
Does that seem reasonable to you?
Yes, I think it's good solution for this case :)
context
Unicode contains many space symbols: https://www.htmlsymbols.xyz/punctuation-symbols/space-symbols
proposed solution
All space symbols need to convert to one form (default ASCII space)