Mention start and length are measured in terms of UTF-16 code units, not code points or bytes

brenns10 commented 7 months ago

Hello! I'm not sure whether to call this a bug, missing documentation, or user error, but I wanted to at least share an issue I've observed relating to mention indices and string encodings.

Description of the issue

For incoming messages, mentions are provided as an array of objects with "start" and "length" field, representing the substring to replace with the mention. However, it's unspecified what the units are for start and length. I had personally expected one of two options: (a) Unicode code points, or (b) bytes in the UTF-8 encoded string. Though, (a) seemed far more likely.

However, the answer is neither (a) nor (b)... instead, the unit seems to be in UTF-16 code units (c). For unicode code points which are beyond the basic multilingual plane (i.e. code points >= 0x10000), the UTF-16 representation involves a "surrogate pair" of UTF-16 code units. A common example of characters beyond the BMP are Emoji. To illustrate this, imagine the message: hello 💩 @user. Here's a table showing each character in the string that signal-cli would return for this message, along with its indices for (a), (b), and (c).

char  unicode  (a)unicode#  (b)utf8byte#  (c)utf-16#
h     U+68     0            0             0
e     U+65     1            1             1
l     U+6C     2            2             2
l     U+6C     3            3             3
o     U+6F     4            4             4
SPC   U+20     5            5             5
💩    U+1F4A9  6            6-9           6-7
SPC   U+20     7            10            8
@user U+FFFC   8            11-13         9

The first column is the character (or some representation of it). The second column is the unicode code point. The third column is the index of that character if you're counting by (a) unicode code points. The fourth column is the index of that character if you're counting by (b) utf-8 encoded bytes. The fifth column is the index of that character if you're counting by (c) UTF-16 code units.

The @user mention seems to be commonly represented with U+FFFC ("Object replacement character"). That code point is within the BMP, so it is represented by one UTF-16 code unit (but three bytes in UTF-8). The 💩 emoji is U+1F4A9 which is beyond the BMP, so it is represented by a surrogate pair of UTF-16 code units, and four UTF-8 bytes.

Here's the array of mentions which signal-cli's jsonRpc gives for that string (replaced the identifiers for privacy):

[
  {
    "name": "+1XXXXXXXXXX",
    "number": "+1XXXXXXXXXX",
    "uuid": "00000000-0000-0000-0000-000000000000",
    "start": 9,
    "length": 1
  }
]

As you can see, signal-cli says that the starting index for the mention is 9 -- which can only be correct if we're counting by (c) UTF-16 code units!

The reason I find this behavior unexpected and confusing is that the actual emoji appears in the JSON UTF-8 encoded! So there's no reason for an application to expect that it should be treating these string indices as UTF-16 code units. If I had to guess, the reason it is this way is because Java internally uses UTF-16 to represents strings, so the indices are done this way to match the way that you would index them in Java. They are represented this way in signald as well, so if I had to guess, these numbers are probably coming directly from the signal protocol/library, and not computed by signal-cli itself.

Impact

For applications written in Python, string indexing is done by code points, so properly-written client software which is using the provided indices would fail. For example, the above message might result in an IndexError since there is no code point at index 9. I'd imagine this happens with other languages that internally represent strings as a sequence of unicode code points.

This can also cause issues when sending messages that have an emoji followed by a mention... if you assumed that the indices were based on unicode code points, you'll find that the message that gets delivered will have replaced the wrong text with your mention, which will make everything look odd.

My use case happens to be in C (I know, I know 😛) so everything is encoded bytes, but as long as I know how the indices should be interpreted, I can handle it myself just fine.

But in general, I don't know what makes the most sense... should signal-cli internally convert "start" and "length" to unicode code points? (How to do that without breaking compatibility with clients using the current representation??) Or should this just be documented somewhere and left alone? (probably the right answer).

signal-cli version info

Linux x86_64, using the regular build from the Github releases page:

$ bin/signal-cli --version
signal-cli 0.13.2

AsamK commented 7 months ago

Yes, those are measured in UTF-16 code units, because that's what the Signal protocol uses, which got the behavior from Android/Java. Text style ranges also behave the same way.

As there's no obvious way to index unicode chars (code points, grapheme clusters, UTF-8 bytes, ...) I'll keep it this way. But you're right the documentation should mention this.

brenns10 commented 7 months ago

Thanks for confirming I'm not crazy! And I do agree after writing it all out, keeping it unchanged makes sense. It's not difficult to get it right as a client so long as you know the behavior.

I put down something in the FAQ to document this and submitted #1505 to update send --mention and send --text-style documentation to mention it. Hope this helps.

AsamK / signal-cli