Genivia / RE-flex

A high-performance C++ regex library and lexical analyzer generator with Unicode support. Extends Flex++ with Unicode support, indent/dedent anchors, lazy quantifiers, functions for lex and syntax error reporting and more. Seamlessly integrates with Bison and other parsers.
https://www.genivia.com/doc/reflex/html
BSD 3-Clause "New" or "Revised" License
504 stars 85 forks source link

Provide a way to inject user code directly into `lex` function. #165

Closed SouravKB closed 1 year ago

SouravKB commented 1 year ago

In Flex, any indented or %{ %} enclosed text appearing before the first rule in rules section is directly injected into the generated lex function.

But in Reflex, indented or %{ %} enclosed text in rules section is placed inside a switch block inside the lex function.

Which means, unlike Flex, in Reflex I can't use it to declare local variables. These local variables can be particularly useful when using more(). I am facing this exact same issue when trying to port code from Flex to Reflex.

The solution is to either use global variables, which makes the lexer non-reentrant, or to store those states inside the lexer object, which means I'm paying for what I'm not using.

So, please add a way to directly inject user code into lex function as in Flex. As of now, Reflex does not work as a drop in replacement for Flex if user code in rules section contains declarations.

SouravKB commented 1 year ago

Actually the documentation falsely states that CODE is directly interpolated into the lex function which is not true.

genivia-inc commented 1 year ago

It's all in the documentation:

The design replicates the design and output of Flex++. Placing the code in the switch block is necessary to accommodate initial code blocks per start condition state. A single code block can be specified for all inclusive states. Note that the switch only occurs with multiple start condition states.

In C++ you can use private member variables instead of having to hack around with local variables in the lex() function, if these variables affect the scanner state. Local variables can also be declared in rules already, which is proper.

genivia-inc commented 1 year ago

Actually the documentation falsely states that CODE is directly interpolated into the lex function which is not true.

Really? How so?

Counter proof:

%%
%{
  SOME CODE
%}
.+  ECHO;
%%

produces:

int yyFlexLexer::yylex(void)
{
  static const reflex::Pattern PATTERN_INITIAL(reflex_code_INITIAL);
  if (!has_matcher())
  {
    matcher(new Matcher(PATTERN_INITIAL, stdinit(), this));
    matcher().interactive();
    YY_USER_INIT
  }
#line 23 "echo.l"

  SOME CODE

  while (true)
  {
        switch (matcher().scan())
        {
          case 0:
            if (matcher().at_end())
            {
              yyterminate();
            }
            else
            {
              output(matcher().input());
            }
            YY_BREAK
          case 1: // rule echo.l:23: .+ :
            YY_USER_ACTION
#line 23 "echo.l"
ECHO;
            YY_BREAK
        }
  }
}

When using multiple start condition states the code blocks will be placed corresponding to the condition in a switch.

genivia-inc commented 1 year ago

Thanks for providing feedback. But issues like these need to have more context and details to be considered.

I have 30+ years of experience with Flex and Bison to design and build compilers for various projects and in my graduate Compiler Construction course I taught for many years, and I've worked with many folks who used RE/flex to replace Flex. I have not many doubts that RE/flex does what it is supposed to do as documented.

SouravKB commented 1 year ago

Counter proof:

My bad, I was using multiple start states, which puts user code inside switch.

The design replicates the design and output of Flex++.

Sorry, I am porting from Flex.

In C++ you can use private member variables instead of having to hack around with local variables in the lex() function, if these variables affect the scanner state.

In my case, No, they won't affect the scanner state. I use it to previous match length when using more(). Does Reflex API provide a function for this already?

Anyway, not being able to declare local variables in lex means Re-flex cannot be used as drop-in replacement for Flex.

genivia-inc commented 1 year ago

In my case, No, they won't affect the scanner state. I use it to previous match length when using more(). Does Reflex API provide a function for this already?

No, you need to use private member variables, because the previous match length should be part of the scanner state. What if your scanner returns? If it returns, you will lose the previous match length. It makes little sense to use local variables this way. I've never seen anyone do that, because it's not safe (and not usable) if you return from the scanner.

Anyway, not being able to declare local variables in lex means Re-flex cannot be used as drop-in replacement for Flex.

It's a drop-in replacement of Flex++. There is a small difference.