A single multibyte char matches dotdot

Genivia / RE-flex

A high-performance C++ regex library and lexical analyzer generator with Unicode support. Extends Flex++ with Unicode support, indent/dedent anchors, lazy quantifiers, functions for lex and syntax error reporting and more. Seamlessly integrates with Bison and other parsers.

https://www.genivia.com/doc/reflex/html

BSD 3-Clause "New" or "Revised" License

504 stars 85 forks source link

A single multibyte char matches dotdot #193

Closed haihuayang closed 11 months ago

haihuayang commented 11 months ago

Hi, I found that in the unicode mode, the generated lex successfully match a single multibyte char with .., could you help me check if it is bad usage or a bug? the following is the program to repro it, i build it by command and test it with the commands

$ reflex --full --batch --unicode --outfile=unicode.cxx unicode.l
$ g++ -g3 -o unicode unicode.cxx -lreflex
$ ./unicode 好
lex(好)=1
$ ./unicode 好的
lex(好的)=1
$ ./unicode aa
lex(aa)=1
$ ./unicode a
lex(a)=0

looks like '..' matching both 1 multibyte char and 2 multibyte char

%option reentrant
%option noyywrap
%option 8bit

%{
#include <assert.h>
#include <sstream>
%}

%%
..x     return 1;
.       return 0;
%%

int main(int argc, char **argv)
{
        std::string v(argv[1]);
        v += "x";
        std::istringstream is(v);

        Lexer lexer;
        int ret = lexer.lex(is);
        printf("lex(%s)=%d\n", argv[1], ret);
        return 0;
}

haihuayang commented 11 months ago

btw, I am using RE-flex 3.3.6.

genivia-inc commented 11 months ago

A dot matches a byte and with --unicode also matches a Unicode character. Because two dots also match two bytes, you get these results with two multi-bytes. To match a single Unicode character always use \X instead of a dot or use \p{Unicode} or a more narrow character class.

A dot is "permissive" on purpose in RE/flex, so that you can match anything, including invalid UTF-8. Otherwise, you can't even define a simple error rule that generates a lexical error when the lexer encounters an invalid byte in the input. The worst case is that the Flex-compatible lexer outputs these invalid bytes by the so-called "default rule" that is enabled by default.

See the RE/flex documentation with an explanation on this subject: Invalid UTF encodings.

I don't know if it is possible to force .. to only match one Unicode character with this approach. After all, each dot permits a single (invalid) Unicode (multi-)byte. Only if we correlate two or more dots into some other pattern, then it might be possible. But that is tricky to do, e.g. .(.|a) has two dots.

genivia-inc commented 11 months ago

Adding to my comments above to explain why this choice was made for RE/flex, note that setting option --nodefault (%nodefault) turns the Flex-based "default rule" off, but this terminates the scanner with an error "scanner jammed" when a rule does not match. That would happen even when defining a "catch-all-else" rule to produce error messages, like so:

.    std::cerr << "lexical error, full stop!" << std::endl;
     return 0;

which will NOT match anything leading to a "scanner jammed" if we don't let . be more permissive and match an invalid or incomplete UTF byte as well. This choice errs on the side of caution and safety, since this is compatible with Flex scanners that assume . match any non-newline, and we don't want an application to terminate with an error "scanner jammed".

An alternative to a "catch-all-else" is to use a RE/flex exception handler, which is nicer: see options and bison-complete example with exception handler.

haihuayang commented 11 months ago

Thank you for sharing the detail information. I will try it.

genivia-inc commented 11 months ago

This is an interesting problem that needs a resolution or at least a document update.

The \X pattern is restricted to valid Unicode only and is much more complex than .. But it looks like it might have the same problem as . actually, since it also permits bytes to match. So this needs an update to fix that. \p{Unicode} works alright.

haihuayang commented 11 months ago

maybe you can make \X match a single valid unicode or a single byte when it is in invalid unicode sequence.

genivia-inc commented 11 months ago

I agree that \X should be changed to only match Unicode. It is an oversight on my part not to have implemented it that way in the first place.

The . is still special for the reasons stated above.

It is no trivial or even possible to design a DFA representation of . for an UTF-8 matcher to match a valid Unicode character or a single byte of an invalid Unicode multi-byte sequence. The problem is that we will still end up with .. matching a single valid Unicode character in some cases. For example, trying to match an ASCII char with a byte-based regular expression [\x00-\x7f] or a non-starting UTF-8 byte (invalid Unicode) [\x80-\xbf] or a valid Unicode character [\xc2-\xd7][\x80-\xbf]|[\xe0-\xef][\x80-\xbf] (simplified up to 3 multi-bytes) gives the pattern [\x00-\x7f]|[\x80-\xbf]|[\xc2-\xd7][\x80-\xbf]|[\xe0-\xef][\x80-\xbf]. But two of these patterns when used as .. will still match a single Unicode character in some cases, like in your case.

Just make sure to use . only to catch-all in a scanner/tokenizer, but be aware of the consequences when using Unicode. I will update the documentation to make that more clear. Otherwise, use \X (which I will fix) or \p{Unicode} which is more restrictive.

genivia-inc commented 11 months ago

RE/flex 3.5 is released. This fixes the \X matching problem and clarifies the . in the documentation and why it can match a single byte of a Unicode character encoded UTF-8.

genivia-inc commented 11 months ago

The v3.5 update changes . to match Unicode as if \X was used when the RE/flex regex library is used. But the RE/flex scanners will still consider . permissive and match bytes also.

maybe you can make \X match a single valid unicode or a single byte when it is in invalid unicode sequence.

That is technically impossible with regular expressions. Only with a pre-processor of the input this can be done, but that hurts performance significantly. But we don't want to reduce performance. Secondly, the . is a catch-all-else to find lexical errors in the input.

It is a choice: either . matches valid Unicode only or it matches any byte but also parts of UTF-8 multi-byte sequences.

It is possible to make a choice by adding a new option to RE/flex e.g. %strict to change the . to match like \X and not match as an catch-all-else any longer. But no RE/flex user has presented a use case with a request to do this before. So I have no strong motivation to add this unless there are use cases that show this is really needed.

Just use \X to match Unicode characters exactly in lexers. You can still use a . to match characters, but it is important to use . carefully and not like ... The .. would also in Flex match a single Unicode character multi-byte sequence.