Closed mdehling closed 8 months ago
Updates:
Thank you for your feedback.
This looks indeed very wrong. I've never seen this before on any platform and compiler tested. Something breaks to parse and convert regular expressions to code. The subsumption warnings are wrong and indicate something is broken in regex parsing and conversion.
Reflex is not dependent on the compiler nor on the OS nor on 32 bit versus 64 bit, which were tested. I cannot explain why this happens without tracing the execution (see further below).
"braille.l:63:44: warning:" this line has UTF-8 character encodings like all the other:
⠉ emit("c"); start(letters);
and the correct generated code is the following with valid UTF-8 in the literal string:
static const char *REGEX_INITIAL = "(?m)(⠰)|(⠁)|(⠃)...
whereas in the error you get, it shows as:
static const char *REGEX_INITIAL = "(?m)(<A0>)|(<A0>)|((?:⠀))
I don't have Clang 16 (clang 14). I wish there is a Docker container available or can be created to replicate the problem, because I don't want to mess with my development environments.
To trace all reflex for debugging, change lib/Make to enable this line:
CMFLAGS=-DDEBUG
then execute make -f Make clean
and make -f Make
in the lib directory. Do the same in the src directory by changing src/Make to build the reflex tool.
Running the reflex command generates a large DEBUG.log file. This is useful to trace the execution and to spot differences right away e.g. by comparing with diff
.
Thanks for the quick response and happy pi day :)
Adding -DDEBUG
to CXXFLAGS
in lib/
leads to the following error:
pattern.cpp: In member function ‘void reflex::Pattern::gen_match_hfa_start(reflex::Pattern::DFA::State*, reflex::Pattern::HFA::State&, reflex::Pattern::HFA::StateHashes&)’:
pattern.cpp:4417:40: error: ‘state’ was not declared in this scope
4417 | DBGLOG("0 HFA %p: %u..%u -> %p", state, lo, hi, next_state);
| ^~~~~
The log statement in this function was introduced in v4.0.0 and at the same time the state
parameter was renamed start
. After replacing state
with start
in the DBGLOG()
statement the build completes.
I've attached the DEBUG.log
files for a run of reflex braille.l
on NetBSD and Ubuntu. The diff -u
output for logs with timestamps trimmed looks like this:
--- DEBUG-Ubuntu-trimmed.log 2024-03-14 16:21:37.863987366 -0700
+++ DEBUG-NetBSD-trimmed.log 2024-03-14 16:21:22.395122036 -0700
@@ -9,12 +9,18 @@
pattern.cpp:1149 BEGIN parse4(5)
pattern.cpp:1402 END parse4()
pattern.cpp:1133 END parse3()
-pattern.cpp:988 BEGIN parse3(6)
-pattern.cpp:1149 BEGIN parse4(6)
+pattern.cpp:972 END parse2()
+pattern.cpp:851 END parse1()
pattern.cpp:1402 END parse4()
pattern.cpp:1133 END parse3()
-pattern.cpp:988 BEGIN parse3(7)
-pattern.cpp:1149 BEGIN parse4(7)
+pattern.cpp:972 END parse2()
+pattern.cpp:867 BEGIN parse2(8)
+pattern.cpp:988 BEGIN parse3(8)
+pattern.cpp:1149 BEGIN parse4(8)
+pattern.cpp:810 BEGIN parse1(9)
+pattern.cpp:867 BEGIN parse2(9)
+pattern.cpp:988 BEGIN parse3(9)
+pattern.cpp:1149 BEGIN parse4(9)
pattern.cpp:1402 END parse4()
pattern.cpp:1133 END parse3()
pattern.cpp:972 END parse2()
[...]
Does this tell you anything?
The diff shows that the regex pattern parsed from the braille.l file is different somehow, or the regex converter in lib/convert.cpp behaves differently to convert the regex pattern. I can see this from the parseN() calls that show this difference at a |
(alternation) which aligns with the pattern errors:
"(?m)(<A0>)|(<A0>)|((?:⠀))
versus
"(?m)(⠰)|(⠁)|(⠃)
which has UTF-8 multibytes, not a single <A0>
character that is not valid UTF-8.
The problem is either somewhere when the patterns are taken from the braille.l file or in the next stage, the converter that analysis the regex pattern strings to convert when needed, e.g. to convert Unicode regex to produce a regex that works with any 8-bit regex engine. So it may not be the pattern-to-DFA construction in pattern.cpp, which is all this DEBUG.log code.
A <A0>
in the regex pattern makes no sense to see here. The Braille code blocks starts at U+2800 with multibyte E2 A0 80. But there is a single <A0>
when the error occurs, is there? Or is there a part of a UTF-8 multibyte sequence? The generated lexer code has a very long regex pattern string that has these bytes. I wonder if it is
What if you truncate the braille.l definitions to just a few patterns? Like so:
%%
<*>⠰ start(INITIAL);
<INITIAL,letters>{
⠁ emit("a"); start(letters);
}
<INITIAL>{
// CONTRACTIONS
⠃/{la} emit("but");
}
<letters>{
// DIGRAPHS
⠂ emit("ea");
}
<*>⠼ start(numbers);
<numbers>{
⠁ emit("0");
}
<*>{
// PUNCTUATION
⠂/{la} emit(",");
// SPACING
\u{2800} emit(" "); start(INITIAL);
[ \t\n\r] emit(text()); start(INITIAL);
// OTHER BRAILLE LIGATURES
\p{Braille} emit(text());
}
<*>. std::cerr << "Error in input at " << lineno() << std::endl;
%%
That should make it easier to see what the generated string looks like with the <A0>
.
This is a very strange issue. I've avoided undefined behavior and things like char being unsigned or signed that differ with compilers/OS.
Stepping through Reflex::get_pattern()
in gdb led to the underlying issue:
On my NetBSD system, std::isspace()
returns true for '\xE2'
and '\xB0'
and false for '\xA0'
. On my Ubuntu test machine it returns false for all three characters.
As a result the pattern recognized on my NetBSD system is just the middle '\xA0'
of the three byte UTF-8 sequence.
I'm not sure if this is a NetBSD libc bug or some configuration issue (the standard allows for locale-specific whitespace characters.) Either way, I'm sorry for the noise. It seems like this is not a RE-flex issue.
Or this may be caused by passing a char to isspace() without casting to unsigned char first.
The Reflex::get_pattern()
uses several isspace()
which are all applied to the char value of line.at(pos)
where Reflex::line
is a std::string
.
The std::string::at()
method returns a char
reference. Whatever the sign of a char
is on this platform should match with the signedness that std::isspace()
expects as an int
. We should not have to cast. Something is fundamentally wrong with this 🤔
I think you are mistaken: see, e.g., https://en.cppreference.com/w/cpp/string/byte/isspace .
std::isspace expects the int argument to be in the range of an unsigned char, or equal to EOF. Anything else is undefined behavior.
(And while preparing a patch for this, I noticed that you do indeed handle this correctly in other places!)
Even after many years of programming C and C++ and many other PL, some complacency sets in with such completely counter-intuitive constructs in C and C++.
I 100% understand :) This is very unfortunate historical baggage in the C & C++ standards.
Fixed in 4.1.2 along with a few other improvements based on ugrep issue discussions at https://github.com/Genivia/ugrep/issues/339
Trying to build the braille example using clang 16.0.6 or gcc 10.4.0 leads to the following errors:
Am I doing something wrong? I am not very knowledgeable when it comes to character encodings.
Build system is NetBSD 10 (beta).