In Unicode, characters from Supplementary Planes are encoded as two 16-bit code units called surrogate pairs (with high surrogate in the range of 0xD800\~0xDBFF and low surrogate in the range of 0xDC00\~0xDFFF).
In Python 3.6, these values are not printable, for example:
Currently, voc does not support surrogate pairs. I've created a test case in tests\datatypes\test_str.py:
@expectedFailure
def test_isprintable_surrogate_cases(self):
self.assertCodeExecution(r"""
tests = ['\ud800', '\udbff', '\udc00', '\udfff']
for test in tests:
print(test.isprintable())
""")
The results:
Traceback (most recent call last):
File "...\voc\voc\python\ast.py", line 160, in visit
super().visit(node)
File "C:\Python36\lib\ast.py", line 253, in visit
return visitor(node)
File "...\voc\voc\python\ast.py", line 49, in dec
fn(self, node, *args, **kwargs)
File "...\voc\voc\python\ast.py", line 2081, in visit_Str
self.context.add_str(node.s)
File "...\voc\voc\python\blocks.py", line 151, in add_str
python.Str(value),
File "...\voc\voc\python\blocks.py", line 44, in add_opcodes
if opcode.process(self):
File "...\voc\voc\python\types\python.py", line 272, in process
JavaOpcodes.LDC_W(self.value),
File "...\voc\voc\java\opcodes.py", line 3615, in __init__
self.const = String(const)
File "...\voc\voc\java\constants.py", line 520, in __init__
self.value = Utf8(value)
File "...\voc\voc\java\constants.py", line 966, in __init__
self._bytes = string.encode('mutf-8')
File "...\voc\voc\java\mutf_8.py", line 226, in encode
return IncrementalEncoder(errors).encode(input, final=final), len(input)
File "...\voc\voc\java\mutf_8.py", line 151, in encode
(result, consumed) = self._buffer_encode(data, self.errors, final)
File "...\voc\voc\java\mutf_8.py", line 162, in _buffer_encode
final
File "...\voc\voc\java\mutf_8.py", line 203, in _buffer_encode_codepoint
return codecs.utf_8_encode(input, self.errors)
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in position 0: surrogates not allowed
Looking at the traceback, I'm guessing additional processing would be needed to check for these surrogates during transpilation to Java bytecode.
In Unicode, characters from Supplementary Planes are encoded as two 16-bit code units called surrogate pairs (with high surrogate in the range of 0xD800\~0xDBFF and low surrogate in the range of 0xDC00\~0xDFFF).
In Python 3.6, these values are not printable, for example:
Currently, voc does not support surrogate pairs. I've created a test case in
tests\datatypes\test_str.py
:The results:
Looking at the traceback, I'm guessing additional processing would be needed to check for these surrogates during transpilation to Java bytecode.