Genivia / RE-flex

A high-performance C++ regex library and lexical analyzer generator with Unicode support. Extends Flex++ with Unicode support, indent/dedent anchors, lazy quantifiers, functions for lex and syntax error reporting and more. Seamlessly integrates with Bison and other parsers.
https://www.genivia.com/doc/reflex/html
BSD 3-Clause "New" or "Revised" License
504 stars 85 forks source link

Question/feature request: discard part of a match without returning it to the input? #190

Closed tlemo closed 11 months ago

tlemo commented 1 year ago

Does RE/flex provide any way to discard part of the matched input, without returning the discarded part to the input? Basically something along the lines of matcher.less(n), but without returning the trimmed part to the input, but discarding it.

Consider a hypothetical example, scanning a basic string literal:

\"   { push_state(STRING); }
<STRING>[^"] { matcher().more(); }
<STRING>\" { pop_state(); return kStringToken; }

This would work fine, although it comes with a small nuisance in the form of the last \", which is part of the lexeme (for "foo", the lexeme would be foo"). While this is minor, it would be nice to avoid special casing to drop the '\"' outside the lexer.

PS. This is similar to the XML example in the RE/flex user guide, where less(size-1) is used

genivia-inc commented 11 months ago

To discard part of the match without retracing the input, use the str() method (or wstr() for wide strings) to place the match in a std::string that can be truncated.

tlemo commented 11 months ago

Thanks. Yes, truncating the lexeme later on is one option, but it's not pretty. Not all the rules require immediate truncation, which means that at scanner level where we deal with generic tokens & lexemes, we'd keep the "un-truncated" lexeme around.

This might pop up in diagnostics or other places where the generic tokens are processed. It's not a major issue, but it would be nicer & simpler to just produce the clean lexemes in the first place, and not require a later fix-up.

genivia-inc commented 11 months ago

Well, as you know less(n) truncates but restarts at the truncated position. The reason is the same as Flex to ensure high-performant scanning by just inserting a \0 at the position of the end of the pattern match. This lets yytext return a pointer to the buffer with the terminating \0. The \0 is then replaced with the character that occupied it before when the scanner continues. It is simple, efficient, and fast.

If you want to extract a lexeme that is shorter after some truncation, then just copy the part of it that you need. The most efficient way is to use std::string(matcher->begin(), matcher->size()-1) to place the lexeme in a string without the last character, for example. Or copy it to a buffer, or send it to a stream, etc.

tlemo commented 11 months ago

Well, as you know less(n) truncates but restarts at the truncated position. The reason is the same as Flex to ensure high-performant scanning by just inserting a \0 at the position of the end of the pattern match. This lets yytext return a pointer to the buffer with the terminating \0. The \0 is then replaced with the character that occupied it before when the scanner continues. It is simple, efficient, and fast.

I think that discarding characters would still fit within that model - the \0 would just be in front on the restart position. Discarding seems to me like half of less(): it would move back the end of the match w/o also changing the restart position. Am I missing anything?

If you want to extract a lexeme that is shorter after some truncation, then just copy the part of it that you need. The most efficient way is to use std::string(matcher->begin(), matcher->size()-1) to place the lexeme in a string without the last character, for example. Or copy it to a buffer, or send it to a stream, etc.

Thanks, but that still leave me with the having to either copy all lexemes or carry un-truncated lexemes around at scanner level, which is precisely what I'd like to avoid.

genivia-inc commented 11 months ago

I think that discarding characters would still fit within that model - the \0 would just be in front on the restart position. Discarding seems to me like half of less(): it would move back the end of the match w/o also changing the restart position. Am I missing anything?

Some background: it is possible to store a \0 anywhere in the yytext() that points to the buffer like the old Lex suggested, but Flex changed yytext() to const char* to prevent that. And there are good reasons for that in RE/flex as well: an input buffer may be read-only, for example when scanning read-only mmap'ed files that can be tokenized with buffer(memory) to scan memory in place. This means that yytext() should never be used, but instead begin() should be used that is a pointer to the lexeme that is not \0-terminated. This is documented in the RE/flex manual and allows for efficient zero-copy scanning of mmap'ed files. Adding a lexeme truncation method only complicates things (it is not allowed in that case) with very little benefit since you can just extract the lexeme in any way you want.