Genivia / RE-flex

A high-performance C++ regex library and lexical analyzer generator with Unicode support. Extends Flex++ with Unicode support, indent/dedent anchors, lazy quantifiers, functions for lex and syntax error reporting and more. Seamlessly integrates with Bison and other parsers.
https://www.genivia.com/doc/reflex/html
BSD 3-Clause "New" or "Revised" License
504 stars 85 forks source link

Support for negative lookaheads and lookbehinds in RE-flex. #168

Closed SouravKB closed 1 year ago

SouravKB commented 1 year ago

In the user guide, it actually says that,

To prevent a Perl matcher from matching a keyword when an identifier starts with the name of that keyword, we could use a lookahead pattern such as int(?=[^A-Za-z0-9_]) ...

But actually, int(?=[^A-Za-z0-9_]) won't match int if it is the last token in the file, since it requires one non-word-character after it. The correct regex pattern for matching int would be int(?![A-Za-z0-9_]), which correctly conveys the intension that there should not be a word-character after it.

AFAIK, there is no support for negative lookaheads and lookbehinds currently. Can you please add them?

Or do you suggest a better way to handle this in current RE-flex?

SouravKB commented 1 year ago

In my case, I am using reflex-matcher (not perl-matcher). I want to match 12.34 to NUM(12.34) whereas 12.34ab to NUM(12), OPER(.), ID(34ab). If negative lookaheads were supported, I could write this as [0-9_]+([.][0-9_]+)?(?![a-zA-Z0-9_]). But unfortunately, negative lookahead is not working as of now. What should I do?

SouravKB commented 1 year ago

Or maybe can you please let us customize what constitutes word boundaries!! Maybe by adding an option, like %option word-char="\p{L}\p{M}\p{N}\p{Pc}". If I were able to edit what matches word boundary, I can easily solve my above problem by adding \> at end of NUM pattern. Currently, word characters are hard coded to [\p{L}\p{Nd}\p{Pc}] which sadly does not meet my requirements.

genivia-inc commented 1 year ago

In the user guide, it actually says that,

To prevent a Perl matcher from matching a keyword when an identifier starts with the name of that keyword, we could use a lookahead pattern such as int(?=[^A-Za-z0-9_]) ...

But actually, int(?=[^A-Za-z0-9_]) won't match int if it is the last token in the file, since it requires one non-word-character after it. The correct regex pattern for matching int would be int(?![A-Za-z0-9_]), which correctly conveys the intension that there should not be a word-character after it.

It says "we could use" and "such as", which leaves it entirely to the developer to choose an appropriate pattern for his/her purposes. This example also suggests that int isn't the last token in a file (it's a type int after all). Nothing wrong with that.

AFAIK, there is no support for negative lookaheads and lookbehinds currently. Can you please add them?

The "Perl matcher" which is based on PCRE supports lookahead and lookbehinds in RE/flex, as the RE/flex documentation says. So I fail to see your point.

However, Perl matching isn't identical to POSIX matching (see documentation, i.e. no longest leftmost match), so your rules will behave differently. There is no technical possibility to add lookbehind to a POSIX matcher without running into theoretical limitations of efficient POSIX matching with a DFA. Lookaheads are theoretically and practically possible in a POSIX matcher, but their use is rare or non-existent in programming language tokenizers, for which "trailing contexts" is a possibility (a lookahead at the end of a pattern).

A long time ago I added the "Perl matcher" to handle all these type of problems with the traditional POSIX matching of Flex/Lex lexers. A POSIX matcher isn't required to be used at all cost.