Complex rules with Unicode result in compilation error

Genivia / RE-flex

A high-performance C++ regex library and lexical analyzer generator with Unicode support. Extends Flex++ with Unicode support, indent/dedent anchors, lazy quantifiers, functions for lex and syntax error reporting and more. Seamlessly integrates with Bison and other parsers.

BSD 3-Clause "New" or "Revised" License

522 stars 85 forks source link

When using complex rules containing Unicode characters without combining this with the --fast or --full reflex options, GCC issues an error message like this when trying to compile the generated code:

foo.ll: In member function ‘virtual int Lexer::lex()’:
foo.ll:196: error: narrowing conversion of ‘4294967266’ from ‘long int’ to ‘char’ [-Wnarrowing]

This is a minimal lexer file that produces such an error on Windows:

%option unicode

EnDash \u{2013}

VariableChars [\p{Lu}\p{Ll}\p{Lt}\p{Lm}\p{Lo}\p{Nd}_?]+
VariableToken "$$"|"$?"|"$^"|"$"{VariableChars}|"$"\{([^`}]|`.)+\}

%%

{EnDash}
{VariableToken}

%%

IIRC the regex size is barely above 16kB, Linux would require a 64kB regex. The \u{2013} causes the issue while VariableToken expands the regex above the mentioned size.

The reason is a conversion of signed characters to unsigned ints in write_regex(). I've fixed the issue in #106 by casting the character to the version of the int type that matches that of the char type. Also the character which is passed to isprint was promoted to an int. This triggered an assertion error while debugging the code with Visual Studio, so I fixed this, too.

The fix for the generated code assumes, however, that both reflex and the lexer are compiled with the same signedness of the plain char type. The build instructions should then be extended by an advise to use the proper version (whatever is used in the shipped binary), but I don't know whether this causes any side effects. A truly portable way of fixing this would be to either explicitly specify the signedness of the characters of the generated REGEX_...-array, or to not use raw numbers at all and instead encode them as hex numbers in char literals.

I've implemented the latter approach anyway since I didn't realize the generated character was indeed proper UTF8 and mistook it for ASCII in the text editor. It is available here: https://github.com/GDATASoftwareAG/RE-flex/tree/bugfix/no_raw_utf8 Hopefully the fix is correct, I just copied one of the many different conversions from this function. At least with my lexer file it worked as desired.

As I don't know what your future plans are for RE/flex I'll start with this PR. Just let me know if you also want a PR for the other branch.

Genivia / RE-flex

Complex rules with Unicode result in compilation error #105