Tokeniser incorrectly handles 4 byte Unicode characters

The text segmentation technique polyglot applies in tokenize/base.py does not function correctly for 4-byte Unicode characters, such as these;

>>> from polyglot.text import Text

>>> text = Text("Hello this is a test. 👎. Hello this is a test.")
>>> " ".join(text.tokens)
'Hello this is a test . 👎. H ello t his i s a t est.'

>>> text = Text("Hello this is a test. 𠜎. Hello this is a test.")
>>> " ".join(text.tokens)
'Hello this is a test . 𠜎. H ello t his i s a t est.'

>>> text = Text("Hello this is a test. 👎👎. Hello this is a test.")
>>> " ".join(text.tokens)
'Hello this is a test . 👎👎 . H e llo t h is i s a t e st.'

As you can see, polyglot starts incorrectly placing tokenisation boundaries after a 4-byte character is used. This is because ICU (and therefore pyICU) actually does not segment by Unicode code points, but rather by 2-byte unicode code units. From the docs;

In ICU, a Unicode string consists of 16-bit Unicode code units. A Unicode character may be stored with either one code unit (the most common case) or with a matched pair of special code units ("surrogates"). The data type for code units is char16_t. For single-character handling, a Unicode character code point is a value in the range 0..0x10ffff. ICU uses the UChar32 type for code points. Indexes and offsets into and lengths of strings always count code units, not code points

This means ICU returns character boundaries that count 4-byte characters as two characters, thus a naive segmentation approach with Python strings (which count them as one character) causes boundary drift.

One possible fix is to encode the text as UTF-16LE and perform the manipulations on the UTF-16LE encoded bytestrings. Like this example;

def segment_text(language_code, text):
    locale = Locale(language_code)
    boundary = BreakIterator.createWordInstance(locale)
    boundary.setText(text)
    start = boundary.first()
    words = []
    for end in boundary:
        s = boundary.getText().getText().encode('utf-16le')
        word = s[start*2:end*2].decode('utf-16le').strip()
        if word:
            words.append(word)
        start = end
    return " ".join(words)

However this is quite hacky.

aboSamoor / polyglot

Tokeniser incorrectly handles 4 byte Unicode characters #118