Genivia / RE-flex

A high-performance C++ regex library and lexical analyzer generator with Unicode support. Extends Flex++ with Unicode support, indent/dedent anchors, lazy quantifiers, functions for lex and syntax error reporting and more. Seamlessly integrates with Bison and other parsers.
https://www.genivia.com/doc/reflex/html
BSD 3-Clause "New" or "Revised" License
520 stars 85 forks source link

How to match anything until a delimiter is encountered? #170

Closed SouravKB closed 1 year ago

SouravKB commented 1 year ago

I want to match the syntax ('*)".*?"\1. It should match "foo", ''"bar"'', but should not match ''"baz"'. Is there a correct way to match this using reflex matcher? The nearest I could achieve was the following lexer:

%x STRING

%%

'*\" {
    textLen = 0uz;
    quoteLen = size();
    start(STRING);
}

<STRING> {

\"'* {
    if (size() < quoteLen) goto MORE_TEXT;
    matcher().less(quoteLen);
    start(INITIAL);
    res = std::string{begin(), textLen};
    return TokenKind::STR;
}

[^"]* {
    MORE_TEXT:
    textLen = size();
    matcher().more();
}

<<EOF>> {
    std::cerr << "Lexical error: Unterminated 'STR' \n";
    return TokenKind::ERR;
}

}

%%

The problem with this is that, it matches only valid UTF8 inside strings. But I want anything inside string to be matched. I considered three workarounds. But all three seems to have some issues.

  1. Use skip(). This skips all characters till it reaches delimiter. But in the process, it consumes all the string content. I don't get to keep them.
  2. Use .*?\" instead of [^"]*. This works for every properly terminated strings. But gets the lexer jammed if the string is not terminated.
  3. Use consume string content character by character using .. Since . is synchronizing, it can even match invalid UTF8 sequences. But this approach feels way too slow.

So is there any better approach for solving this?

genivia-inc commented 1 year ago

It is fine to assume a Perl regex matcher is used (because of the \1 backreference). Just use option -m or %matcher to specify a matcher, see the documentation.

However, there are some things to be aware of:

This is actually not really a RE/flex question or issue, but a general tokenization or regex question for which there are other venues to obtain advice.

SouravKB commented 1 year ago

what you want can be best done with the usual POSIX matcher of Flex and RE/flex by counting the number of opening quotes and matching the same number at the closing.

That's what I was trying to do here (I missed to mention that I was using reflex matcher). In my first comment above, I have provided an example lexer specification that almost works. But it is not fully correct. So I needed help.

It is also true that I should have asked it in some Q&A forum. But the question isn't a general regex question. It is specific to the abilities of reflex matcher. Hence I decided to ask here.

SouravKB commented 1 year ago

I have asked the question here on stackoverflow. I have settled for a workaround that I have shown in answers there itself.