Incorrect behavior of `utf16_decode_char` when decoding UTF-16 BE characters.

JX-Master commented 1 year ago

Test code:

c32 ch = 0x0041;
c8 ch_utf16_le[] = {0x41, 0x00};
c8 ch_utf16_be[] = {0x00, 0x41};

c32 ch_utf16_check_le = utf16_decode_char((c16*)ch_utf16_le);
lutest(ch == ch_utf16_check_le); // pass

c32 ch_utf16_check_be = utf16_decode_char((c16*)ch_utf16_be);
lutest(ch == ch_utf16_check_be); // fail

JX-Master commented 1 year ago

For character in range 0x000~0xFFFF, we cannot detect the byte order from codepoint directly (0x4E2D and 0x2D4E is both valid, for example), so we need two dedicated functions to handle UTF16-LE and UTF-16-BE, while the basic utf16_decode_char decodes UTF-16 in platform-specific byte order (LE or BE of the processor).

JX-Master commented 1 year ago

This is actually not a bug. When the user reads byte streams from file, she must ensure that every c16 character in the file are read in the correct byte order. For example, if the file was saved in UTF-16 BE and is opened in a little-endian platform (like x86), the user must convert every c16 to little endian by swapping the low 8-bits and high 8-bits before passing the c16 string to utf16_decode_char. The byte order of one UTF-16 files is usually identified by the Byte Order Mark (BOM) placed at the beginning of the file, which will be 0xFE 0xFF for BE and 0xFF 0xFE for LE.

JX-Master / LunaSDK

Incorrect behavior of `utf16_decode_char` when decoding UTF-16 BE characters. #16