Genivia / RE-flex

A high-performance C++ regex library and lexical analyzer generator with Unicode support. Extends Flex++ with Unicode support, indent/dedent anchors, lazy quantifiers, functions for lex and syntax error reporting and more. Seamlessly integrates with Bison and other parsers.
https://www.genivia.com/doc/reflex/html
BSD 3-Clause "New" or "Revised" License
522 stars 85 forks source link

Save and Restore Scanner State #132

Closed ivanpedruzzi closed 2 years ago

ivanpedruzzi commented 2 years ago

Hello, I love Re/flex so far and the support for Unicode is FANTASTIC!

I was wondering if you have a suggestion for how to save the scanner state and position and how to restart the scanner at specific position and state. Any hint will be greatly appreciated.

Does the scanner require to consume the complete input or can it operate on-demand, processing chunks? For example if I provide an istream implementation with GBs of data , can the scanner lazy load without pulling the entire data in memory?

Thank you -Ivan

genivia-inc commented 2 years ago

Thank you for your kind comments.

I was wondering if you have a suggestion for how to save the scanner state and position and how to restart the scanner at specific position and state.

Saving the scanner state with push_state() (Flex yy_push_state()) and restoring with pop_state() (Flex yy_pop_state()) does not save and reposition the location of the input to scan. Scanning will simply continue.

Saving the scanner buffer (input state) with push_matcher() (Flex yypush_buffer_state()) can be used to temporarily switch to a different input source to scan, then restore the original input with pop_matcher(). You could use this to scan the same file from the current location, but you will have to open the current file as a new file and move the cursor to the location with seek(), scan the input, and close it before pop_matcher().

Does the scanner require to consume the complete input or can it operate on-demand, processing chunks?

Normally the scanner reads 64K chunks of input and does not keep the whole file in memory. Unless you use in(str) with string str to scan or more efficiently buffer(buf, n+1) (Flex yy_scan_buffer(buf, n+1)) to scan buf as input with n bytes (+1 or 2 bytes are needed in buf to serve to store a terminating \0).

ivanpedruzzi commented 2 years ago

Robert, I am failing to replicate what I had working with flex. In my original implementation I simply overridden scanner LexerInput to wire my own data stream.

With reflex even if I override LexerInput the Matcher checks its own object “Input”.

yylex calls Matcher.buffer(), this is the basic loop in method "buffer", method “get” calls LexerInput

while (in.good()) { (void)grow(); end += get(buf + end, max - end_); }

should the while simply check if the number of bytes returned from "get" are less than what was required?

My second attempt was to create my own implementation of reflex::Input but the Matcher uses an instance instead of a pointer, it seems there is no way make the Matcher to use an Input sub-class.

Clearly, I am missing something obvious.

genivia-inc commented 2 years ago

The get() method can return fewer than max_ - end_ bytes. But zero returned means EOF. I see that it's not checking that, so we need to change this to:

    while (in.good()) // there is more to get while good(), e.g. via wrap()
    {
      (void)grow();
      size_t len = get(buf_ + end_, max_ - end_);
      if (len == 0)
        break;
      end_ += len;
    }

Does that help?

ivanpedruzzi commented 2 years ago

In addition to your astute suggestion, I had to convert from unicode to UTF-8, before copying data into the buffer.

size_t CScannerBase::LexerInput(char* buf, size_t char_max_size) {
m_wchar_buffer = ... reflex::Input reflex_input(m_wchar_buffer.c_str()); ret = reflex_input.get(buf, char_max_size); ret = ret > char_max_size ? char_max_size : ret; //"get" may return length > then the size passed :-/ buf[ret] = 0;

Why relying on null terminal? the lexer could use the length returned.

Next working on how to restore Matcher's state and resume from a given offset

genivia-inc commented 2 years ago

Why relying on null terminal? the lexer could use the length returned.

The size returned is used by the scanner. The LexerInput is called until it returns 0 as per documentation:

The LexerInput method may be invoked multiple times by the matcher engine and should eventually return zero to indicate the end of input is reached (e.g. when at EOF).

The piece-by-piece returned string in the buffer buf does not need to have a final terminating \0.

I'm pretty sure this is the way Flex works, which is replicated.

ivanpedruzzi commented 2 years ago

Robert is there a switch in the code generator that could change the following initial lines in the yylex function from local variables to virtual functions or even static data members, so they can be used in other methods?

int CJSONScanner::yylex(void) { static const char * REGEX_INITIAL = "...."; static const reflex::Pattern PATTERN_INITIAL(REGEX_INITIAL); ....

I would like to initialize the matcher in the constructor like the following but I do not want to copy the code every time the REGEX_INITIAL is generated, notice that I want the matcher to buffer at most 1024 characters.

CJSONScanner::CJSONScanner(IEditBlockBuffer pSmartBuf, bool bCacheTokens /= true/) :CScannerBase(pSmartBuf) { static const char REGEX_INITIAL = ".."; static const reflex::Pattern PATTERN_INITIAL(REGEX_INITIAL); matcher(new Matcher(PATTERN_INITIAL, stdinit(), this)); matcher().buffer(1024); YY_USER_INIT

genivia-inc commented 2 years ago

Perhaps static data members would be best to be able to share their values. There is currently no option switch to do this in reflex. IMO moving the REGEX_INITIAL and PATTERN_INITIAL to become static members is nice. I will have to check if that always works without affecting something.

Also, some way to insert code at YY_USER_INIT to initialize the matcher would be nice, e.g. to set the buffer size.

I'll take a look.

genivia-inc commented 2 years ago

I propose to add the following directive %begin to inject code into the scanner before scanning starts:

%begin{
  BEGIN
}

This outputs BEGIN code right after the matcher is created and before scanning starts:

        if (!has_matcher())
        {
          matcher(new Matcher(PATTERN_INITIAL, stdinit(), this));
          YY_USER_INIT // inserted with flex option
          BEGIN
        }

PS. I don't think that creating static members for REGEX_INITIAL and PATTERN_INITIAL is needed with this approach. If need be, these can be passed to some method calls specified in BEGIN.

ivanpedruzzi commented 2 years ago

I do not believe this solve the problem I reported. I would like to initialize the matcher elsewhere, it seems that your change only covers the initialization in function yylex(). Having the PATTERN_INITIAL generated as local variable seems extremely rigid.

genivia-inc commented 2 years ago

To make the regex and pattern objects static members of the lexer class does not make a lot of sense. The regex and pattern values are kept hidden, because their types and values depend on options, such as --fast and --matcher=boost. There is no regex string when --fast and --full are used. The pattern value is a different type, depending on the matcher engine. Making these accessible as members will cause compilation errors when options change. It is also possible that future reflex versions use different types and/or additions to these. That is why I placed them as locals in the lex function. Creating members for these does not solve anything that cannot be solved already by adding a BEGIN. For example, you want to add this line to initialize the matcher:

matcher().buffer(1024);

This only affects the matcher, not the regex or pattern. Those are different (hidden) entities. So you'll just have to declare:

%begin{
matcher().buffer(1024);
}

I don't understand your comment why you really need the regex and patterns to be static members. Including a user-definable BEGIN code solves your problem. It also can serve additional purposes. We already have YY_USER_INIT, which is inserted with the --flex option but none is inserted otherwise. This plugs that hole.

genivia-inc commented 2 years ago

I will also update option --batch to take an argument. So --batch=1024 generates code the initialize the matcher:

matcher().buffer(1024);