dkpro / dkpro-cassis

UIMA CAS processing library written in Python
https://pypi.org/project/dkpro-cassis/
Apache License 2.0
85 stars 22 forks source link

Importing XMI with smileys creates offsets that are not UIMA conform #135

Closed jcklie closed 4 years ago

jcklie commented 4 years ago

Describe the bug Create a CAS with the text "Hello 😊, my name is John.". Export the file as UIMA CAS XMI. The token offsets after 😊 are wrong when importing it into the real UIMA Java.

The problem is that UIMA does not use offsets in units of codepoints, but in units of utf16 bytes.

Additional context https://github.com/inception-project/inception/issues/1811 https://webanno.github.io/webanno/releases/3.4.5/docs/user-guide.html#_encoding_and_offsets

jcklie commented 4 years ago

Should hopefully be solved with #136