Genivia / RE-flex

A high-performance C++ regex library and lexical analyzer generator with Unicode support. Extends Flex++ with Unicode support, indent/dedent anchors, lazy quantifiers, functions for lex and syntax error reporting and more. Seamlessly integrates with Bison and other parsers.
https://www.genivia.com/doc/reflex/html
BSD 3-Clause "New" or "Revised" License
507 stars 85 forks source link

First character of UTF16LE files with BOM incorrectly decoded #110

Closed MichaelBrazier closed 3 years ago

MichaelBrazier commented 3 years ago

When opening a UTF-16 LE file with a BOM, RE-flex copies only the first byte of the character immediately following the BOM to the output. If the first character in the file has a code point above 0xFF this drops the high-order byte, giving incorrect results.

I found this bug while building a Unicode-aware scanner and testing it with a UTF-16 file in which the first character after the BOM is another BOM - the result was a UTF-8 byte sequence that decodes to a code point well above the Unicode maximum of 0x10FFFF.

The code for detecting the BOM, for UTF-16 LE files, leaves both bytes of the first character in the utf8_ buffer with the BOM, but the initial call to file_get() just copies the first byte. Copying both bytes would not give correct results, either; the bytes have to be read as one 16-bit character.

genivia-inc commented 3 years ago

Interesting. Thanks for reporting this problem and sorry for the trouble. Should be easy to fix for the upcoming update.

genivia-inc commented 3 years ago

The UTF-16 and UTF-32 little endian BOM checking code block in lib/input.cpp:704 should be:

          else if (utf8_[0] == '\xff' && utf8_[1] == '\xfe') // UTF-16 or UTF-32 little endian BOM FFFEXXXX?
          {
            if (::fread(utf8_ + 2, 2, 1, file_) == 1)
            {
              size_ = 0;
              if (utf8_[2] == '\0' && utf8_[3] == '\0') // UTF-32 little endian BOM FFFE0000?
              {
                ulen_ = 0;
                utfx_ = file_encoding::utf32le;
              }
              else
              {
                int c = static_cast<unsigned char>(utf8_[2]) | static_cast<unsigned char>(utf8_[3]) << 8;
                if (c < 0x80)
                {
                  uidx_ = 2;
                  ulen_ = 1;
                }
                else
                {
                  if (c >= 0xD800 && c < 0xE000)
                  {
                    // UTF-16 surrogate pair
                    if (c < 0xDC00 && ::fread(utf8_, 2, 1, file_) == 1 && (static_cast<unsigned char>(utf8_[1]) & 0xFC) == 0xDC)
                      c = 0x010000 - 0xDC00 + ((c - 0xD800) << 10) + (static_cast<unsigned char>(utf8_[0]) | static_cast<unsigned char>(utf8_[1]) << 8);
                    else
                      c = REFLEX_NONCHAR;
                  }
                  ulen_ = utf8(c, utf8_);
                }
                utfx_ = file_encoding::utf16le;
              }
            }
          }
genivia-inc commented 3 years ago

Fixed in v3.0.9