RestComm / Restcomm-Connect

The Open Source Cloud Communications Platform
http://www.restcomm.com/
GNU Affero General Public License v3.0
242 stars 215 forks source link

Double-byte encoded incoming SMS message gets corrupted #2994

Open tomngo opened 5 years ago

tomngo commented 5 years ago

Summary

Certain higher-order characters from a user's handset to a Restcomm-connected bot get corrupted at the Restcomm level. Not all higher-order characters exhibit this problem.

Related Tickets

There are many tickets related to double-byte messages.

Scope of Impact

Every Restcomm-connected bot that can accept arbitrary natural-language input will be affected. Obviously non-US users will be more affected than US users.

There is no reliable workaround. As discussed in #2607, it's possible for the recipient to distinguish reliably between different encodings only if a BOM (U+FFEF) is present. Otherwise, only heuristics are possible and in many cases the information is simply not recoverable even if the sequence of decoding errors is known.

Isolated to Restcomm

I've changed every variable outside of Restcomm, and the behavior is identical:

Affected Characters

Here are some characters that are affected:

Here are some characters that are not affected:

Strangely, some characters that are not affected are higher order than some that are affected.

Examples

My name is José Peña.