Support for Unicode surrogate pairs

BPYap commented 6 years ago

In Unicode, characters from Supplementary Planes are encoded as two 16-bit code units called surrogate pairs (with high surrogate in the range of 0xD800\~0xDBFF and low surrogate in the range of 0xDC00\~0xDFFF).

In Python 3.6, these values are not printable, for example:

>>> '\ud800'.isprintable()
False
>>> '\udc00'.isprintable()
False

Currently, voc does not support surrogate pairs. I've created a test case in tests\datatypes\test_str.py:

    @expectedFailure
    def test_isprintable_surrogate_cases(self):
        self.assertCodeExecution(r"""
        tests = ['\ud800', '\udbff', '\udc00', '\udfff']
        for test in tests:
            print(test.isprintable())
        """)

The results:

Traceback (most recent call last):

  File "...\voc\voc\python\ast.py", line 160, in visit
    super().visit(node)

  File "C:\Python36\lib\ast.py", line 253, in visit
    return visitor(node)

  File "...\voc\voc\python\ast.py", line 49, in dec
    fn(self, node, *args, **kwargs)

  File "...\voc\voc\python\ast.py", line 2081, in visit_Str
    self.context.add_str(node.s)

  File "...\voc\voc\python\blocks.py", line 151, in add_str
    python.Str(value),

  File "...\voc\voc\python\blocks.py", line 44, in add_opcodes
    if opcode.process(self):

  File "...\voc\voc\python\types\python.py", line 272, in process
    JavaOpcodes.LDC_W(self.value),

  File "...\voc\voc\java\opcodes.py", line 3615, in __init__
    self.const = String(const)

  File "...\voc\voc\java\constants.py", line 520, in __init__
    self.value = Utf8(value)

  File "...\voc\voc\java\constants.py", line 966, in __init__
    self._bytes = string.encode('mutf-8')

  File "...\voc\voc\java\mutf_8.py", line 226, in encode
    return IncrementalEncoder(errors).encode(input, final=final), len(input)

  File "...\voc\voc\java\mutf_8.py", line 151, in encode
    (result, consumed) = self._buffer_encode(data, self.errors, final)

  File "...\voc\voc\java\mutf_8.py", line 162, in _buffer_encode
    final

  File "...\voc\voc\java\mutf_8.py", line 203, in _buffer_encode_codepoint
    return codecs.utf_8_encode(input, self.errors)

UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in position 0: surrogates not allowed

Looking at the traceback, I'm guessing additional processing would be needed to check for these surrogates during transpilation to Java bytecode.

jonkiparsky commented 6 years ago

@BPYap I think that https://github.com/pybee/voc/pull/811 resolves this. Can you confirm?

BPYap commented 6 years ago

Hi @jonkiparsky , had confirmed the issue is fix, closing this issue now. Thanks!

beeware / voc

Support for Unicode surrogate pairs #757