Closed thorsten-fluegel closed 3 years ago
The reason is a conversion of signed characters to unsigned ints in write_regex()
.
I've fixed the issue in #106 by casting the character to the version of the int type that matches that of the char type.
Also the character which is passed to isprint was promoted to an int. This triggered an assertion error while debugging the code with Visual Studio, so I fixed this, too.
The fix for the generated code assumes, however, that both reflex and the lexer are compiled with the same signedness of the plain char type. The build instructions should then be extended by an advise to use the proper version (whatever is used in the shipped binary), but I don't know whether this causes any side effects.
A truly portable way of fixing this would be to either explicitly specify the signedness of the characters of the generated REGEX_...
-array, or to not use raw numbers at all and instead encode them as hex numbers in char literals.
I've implemented the latter approach anyway since I didn't realize the generated character was indeed proper UTF8 and mistook it for ASCII in the text editor. It is available here: https://github.com/GDATASoftwareAG/RE-flex/tree/bugfix/no_raw_utf8 Hopefully the fix is correct, I just copied one of the many different conversions from this function. At least with my lexer file it worked as desired.
As I don't know what your future plans are for RE/flex I'll start with this PR. Just let me know if you also want a PR for the other branch.
I encourage you to read my comment with your PR and update your PR to fix this. Thanks for helping out!
When using complex rules containing Unicode characters without combining this with the --fast or --full reflex options, GCC issues an error message like this when trying to compile the generated code:
This is a minimal lexer file that produces such an error on Windows:
IIRC the regex size is barely above 16kB, Linux would require a 64kB regex. The
\u{2013}
causes the issue whileVariableToken
expands the regex above the mentioned size.