Genivia / RE-flex

A high-performance C++ regex library and lexical analyzer generator with Unicode support. Extends Flex++ with Unicode support, indent/dedent anchors, lazy quantifiers, functions for lex and syntax error reporting and more. Seamlessly integrates with Bison and other parsers.
https://www.genivia.com/doc/reflex/html
BSD 3-Clause "New" or "Revised" License
504 stars 85 forks source link

Resetting the lexer with FILE* skips the first character #178

Closed jmaebe closed 1 year ago

jmaebe commented 1 year ago

The issue is with this code:

  /// Reset the matcher and start scanning from the given input character sequence I.
  template<typename I>
  inline AbstractLexer& in(const I& input) ///< a character sequence to scan, e.g. reflex::Input, char*, wchar_t*, std::string, std::wstring, FILE*, std::istream
    /// @returns reference to *this
  {
(1)    in_ = input;
    if (has_matcher())
(2)      matcher().input(input); // reset and assign new input
    return *this;
  }

If you parse a FILE* fh_Input, then use rewind(fh_Input);followed by yyrestart(fh_Input);, the above code will result in the Input(FILE *file) constructor getting called twice (lines (1) and (2)). In both cases, this constructor will call init(), which will read the first resp. second byte of the file to check for an UTF* signature using fread. This means that the first character of the file gets lost after the yyrestart (the second character is cached in utf8_[0] instead of the first one once the actual lexing starts).

Interestingly, I've only encountered this issue under macOS and not under Windows. I did not debug it under Windows to see why it works there.

It's a proprietary lexer so I can't share the code, but it has the following options:

%option noyywrap
%option yylineno
%option unicode

and is generated with reflex --flex --bison

jmaebe commented 1 year ago

Here are the call stacks (v3.0.10).

First:

#0  0x00007fff6292e2da in __fread () from /usr/lib/system/libsystem_c.dylib
#1  0x00007fff6292e127 in fread () from /usr/lib/system/libsystem_c.dylib
#2  0x00000001002059b4 in reflex::Input::file_init (this=0x7ffeefbfe640, enc=0) at ../../lib/input.cpp:676
#3  0x00000001001b7b2f in reflex::Input::init (this=0x7ffeefbfe640, enc=0) at /Data/dev/osiris/re-flex/include/reflex/input.h:743
#4  0x00000001001b7ab7 in reflex::Input::Input (this=0x7ffeefbfe640, file=0x7fff98ed30c8) at /Data/dev/osiris/re-flex/include/reflex/input.h:439
#5  0x00000001001b797d in reflex::Input::Input (this=0x7ffeefbfe640, file=0x7fff98ed30c8) at /Data/dev/osiris/re-flex/include/reflex/input.h:438
#6  0x000000010019f0a8 in reflex::AbstractLexer<reflex::Matcher>::in<__sFILE*> (this=0x100265270 <YY_SCANNER>, input=@0x10025a0e0: 0x7fff98ed30c8) at /Data/dev/osiris/re-flex/include/reflex/abslexer.h:131

Second:

#0  0x00007fff6292e2da in __fread () from /usr/lib/system/libsystem_c.dylib
#1  0x00007fff6292e127 in fread () from /usr/lib/system/libsystem_c.dylib
#2  0x00000001002059b4 in reflex::Input::file_init (this=0x7ffeefbfe5f8, enc=0) at ../../lib/input.cpp:676
#3  0x00000001001b7b2f in reflex::Input::init (this=0x7ffeefbfe5f8, enc=0) at /Data/dev/osiris/re-flex/include/reflex/input.h:743
#4  0x00000001001b7ab7 in reflex::Input::Input (this=0x7ffeefbfe5f8, file=0x7fff98ed30c8) at /Data/dev/osiris/re-flex/include/reflex/input.h:439
#5  0x00000001001b797d in reflex::Input::Input (this=0x7ffeefbfe5f8, file=0x7fff98ed30c8) at /Data/dev/osiris/re-flex/include/reflex/input.h:438
#6  0x000000010019f115 in reflex::AbstractLexer<reflex::Matcher>::in<__sFILE*> (this=0x100265270 <YY_SCANNER>, input=@0x10025a0e0: 0x7fff98ed30c8) at /Data/dev/osiris/re-flex/include/reflex/abslexer.h:133