Closed JX-Master closed 1 year ago
For character in range 0x000~0xFFFF, we cannot detect the byte order from codepoint directly (0x4E2D and 0x2D4E is both valid, for example), so we need two dedicated functions to handle UTF16-LE and UTF-16-BE, while the basic utf16_decode_char
decodes UTF-16 in platform-specific byte order (LE or BE of the processor).
This is actually not a bug. When the user reads byte streams from file, she must ensure that every c16
character in the file are read in the correct byte order. For example, if the file was saved in UTF-16 BE and is opened in a little-endian platform (like x86), the user must convert every c16
to little endian by swapping the low 8-bits and high 8-bits before passing the c16
string to utf16_decode_char
. The byte order of one UTF-16 files is usually identified by the Byte Order Mark (BOM) placed at the beginning of the file, which will be 0xFE 0xFF for BE and 0xFF 0xFE for LE.
Test code: