Describe the bug
Create a CAS with the text "Hello 😊, my name is John.". Export the file as UIMA CAS XMI. The token offsets after 😊 are wrong when importing it into the real UIMA Java.
The problem is that UIMA does not use offsets in units of codepoints, but in units of utf16 bytes.
Describe the bug Create a CAS with the text "Hello 😊, my name is John.". Export the file as UIMA CAS XMI. The token offsets after 😊 are wrong when importing it into the real UIMA Java.
The problem is that UIMA does not use offsets in units of codepoints, but in units of utf16 bytes.
Additional context https://github.com/inception-project/inception/issues/1811 https://webanno.github.io/webanno/releases/3.4.5/docs/user-guide.html#_encoding_and_offsets