Closed MichaelBrazier closed 3 years ago
Interesting. Thanks for reporting this problem and sorry for the trouble. Should be easy to fix for the upcoming update.
The UTF-16 and UTF-32 little endian BOM checking code block in lib/input.cpp:704
should be:
else if (utf8_[0] == '\xff' && utf8_[1] == '\xfe') // UTF-16 or UTF-32 little endian BOM FFFEXXXX?
{
if (::fread(utf8_ + 2, 2, 1, file_) == 1)
{
size_ = 0;
if (utf8_[2] == '\0' && utf8_[3] == '\0') // UTF-32 little endian BOM FFFE0000?
{
ulen_ = 0;
utfx_ = file_encoding::utf32le;
}
else
{
int c = static_cast<unsigned char>(utf8_[2]) | static_cast<unsigned char>(utf8_[3]) << 8;
if (c < 0x80)
{
uidx_ = 2;
ulen_ = 1;
}
else
{
if (c >= 0xD800 && c < 0xE000)
{
// UTF-16 surrogate pair
if (c < 0xDC00 && ::fread(utf8_, 2, 1, file_) == 1 && (static_cast<unsigned char>(utf8_[1]) & 0xFC) == 0xDC)
c = 0x010000 - 0xDC00 + ((c - 0xD800) << 10) + (static_cast<unsigned char>(utf8_[0]) | static_cast<unsigned char>(utf8_[1]) << 8);
else
c = REFLEX_NONCHAR;
}
ulen_ = utf8(c, utf8_);
}
utfx_ = file_encoding::utf16le;
}
}
}
Fixed in v3.0.9
When opening a UTF-16 LE file with a BOM, RE-flex copies only the first byte of the character immediately following the BOM to the output. If the first character in the file has a code point above 0xFF this drops the high-order byte, giving incorrect results.
I found this bug while building a Unicode-aware scanner and testing it with a UTF-16 file in which the first character after the BOM is another BOM - the result was a UTF-8 byte sequence that decodes to a code point well above the Unicode maximum of 0x10FFFF.
The code for detecting the BOM, for UTF-16 LE files, leaves both bytes of the first character in the utf8_ buffer with the BOM, but the initial call to file_get() just copies the first byte. Copying both bytes would not give correct results, either; the bytes have to be read as one 16-bit character.