Genivia / RE-flex

A high-performance C++ regex library and lexical analyzer generator with Unicode support. Extends Flex++ with Unicode support, indent/dedent anchors, lazy quantifiers, functions for lex and syntax error reporting and more. Seamlessly integrates with Bison and other parsers.
https://www.genivia.com/doc/reflex/html
BSD 3-Clause "New" or "Revised" License
504 stars 85 forks source link

Build error in examples/braille #198

Closed mdehling closed 5 months ago

mdehling commented 5 months ago

Trying to build the braille example using clang 16.0.6 or gcc 10.4.0 leads to the following errors:

/pkg/bin/reflex  braille.l
braille.l:277: warning: rule cannot be matched because a previous rule subsumes it, perhaps try to move this rule up?
braille.l:277: warning: rule cannot be matched because a previous rule subsumes it, perhaps try to move this rule up?
braille.l:277: warning: rule cannot be matched because a previous rule subsumes it, perhaps try to move this rule up?
/pkg/bin/clang++ -O2 -I. -I/pkg/include -Wall -Wunused -Wextra -fpermissive -o braille lex.yy.cpp /pkg/lib/libreflex.a
braille.l:63:44: warning: illegal character encoding in string literal [-Winvalid-source-encoding]
  static const char *REGEX_INITIAL = "(?m)(<A0>)|(<A0>)|((?:⠀))|([\\x09\\x0a\\x0d\\x20])|(\\xe2(?:[\\xa0-\\xa3][\\x80-\\xbf]))|((?:.[\\x80-\\xbf]*))";
                                           ^~~~   ~~~~
braille.l:65:44: warning: illegal character encoding in string literal [-Winvalid-source-encoding]
  static const char *REGEX_letters = "(?m)(<A0>)|(<A0>)|((?:⠀))|([\\x09\\x0a\\x0d\\x20])|(\\xe2(?:[\\xa0-\\xa3][\\x80-\\xbf]))|((?:.[\\x80-\\xbf]*))";
                                           ^~~~   ~~~~
braille.l:67:44: warning: illegal character encoding in string literal [-Winvalid-source-encoding]
  static const char *REGEX_numbers = "(?m)(<A0>)|(<A0>)|((?:⠀))|([\\x09\\x0a\\x0d\\x20])|(\\xe2(?:[\\xa0-\\xa3][\\x80-\\xbf]))|((?:.[\\x80-\\xbf]*))";
                                           ^~~~   ~~~~
braille.l:106:1: error: source file is not valid UTF-8
<A0><81>               emit("a"); start(letters);
^
braille.l:106:2: error: source file is not valid UTF-8
<A0><81>               emit("a"); start(letters);
    ^
braille.l:107:1: error: unexpected character <U+2803>
⠃               emit("b"); start(letters);
^~
braille.l:108:1: error: unexpected character <U+2809>
⠉               emit("c"); start(letters);
^~
braille.l:109:1: error: unexpected character <U+2819>
⠙               emit("d"); start(letters);
^~
braille.l:110:1: error: unexpected character <U+2811>
⠑               emit("e"); start(letters);
^~
[...]
/pkg/bin/reflex  braille.l
braille.l:277: warning: rule cannot be matched because a previous rule subsumes it, perhaps try to move this rule up?
braille.l:277: warning: rule cannot be matched because a previous rule subsumes it, perhaps try to move this rule up?
braille.l:277: warning: rule cannot be matched because a previous rule subsumes it, perhaps try to move this rule up?
/pkg/gcc10/bin/g++ -O2 -I. -I/pkg/include -Wall -Wunused -Wextra -fpermissive -o braille lex.yy.cpp /pkg/lib/libreflex.a
braille.l:106:1: error: stray '\240' in program
  106 | // INITIAL ABBREVIATIONS
      | ^
braille.l:106:2: error: stray '\201' in program
  106 | // INITIAL ABBREVIATIONS
      |  ^
braille.l:107:1: error: extended character ⠃ is not valid in an identifier
  107 | ⠐⠓              emit("here"); start(letters);
      | ^
braille.l:108:1: error: extended character ⠉ is not valid in an identifier
  108 | ⠐⠮              emit("there"); start(letters);
      | ^
braille.l:109:1: error: extended character ⠙ is not valid in an identifier
  109 | ⠐⠱              emit("where"); start(letters);
      | ^
braille.l:110:1: error: extended character ⠑ is not valid in an identifier
  110 | ⠐⠑              emit("ever"); start(letters);
      | ^
braille.l:111:1: error: extended character ⠋ is not valid in an identifier
  111 | ⠐⠳              emit("ought"); start(letters);
      | ^
braille.l:112:1: error: extended character ⠛ is not valid in an identifier
  112 | ⠐⠋              emit("father"); start(letters);
      | ^
braille.l:113:1: error: extended character ⠓ is not valid in an identifier
  113 | ⠐⠍              emit("mother"); start(letters);
      | ^
braille.l:114:1: error: extended character ⠊ is not valid in an identifier
  114 | ⠐⠝              emit("name"); start(letters);
      | ^
[...]

Am I doing something wrong? I am not very knowledgeable when it comes to character encodings.

Build system is NetBSD 10 (beta).

mdehling commented 5 months ago

Updates:

genivia-inc commented 5 months ago

Thank you for your feedback.

This looks indeed very wrong. I've never seen this before on any platform and compiler tested. Something breaks to parse and convert regular expressions to code. The subsumption warnings are wrong and indicate something is broken in regex parsing and conversion.

Reflex is not dependent on the compiler nor on the OS nor on 32 bit versus 64 bit, which were tested. I cannot explain why this happens without tracing the execution (see further below).

"braille.l:63:44: warning:" this line has UTF-8 character encodings like all the other:

⠉               emit("c"); start(letters);

and the correct generated code is the following with valid UTF-8 in the literal string:

  static const char *REGEX_INITIAL = "(?m)(⠰)|(⠁)|(⠃)...

whereas in the error you get, it shows as:

static const char *REGEX_INITIAL = "(?m)(<A0>)|(<A0>)|((?:⠀))

I don't have Clang 16 (clang 14). I wish there is a Docker container available or can be created to replicate the problem, because I don't want to mess with my development environments.

To trace all reflex for debugging, change lib/Make to enable this line:

CMFLAGS=-DDEBUG

then execute make -f Make clean and make -f Make in the lib directory. Do the same in the src directory by changing src/Make to build the reflex tool.

Running the reflex command generates a large DEBUG.log file. This is useful to trace the execution and to spot differences right away e.g. by comparing with diff.

mdehling commented 5 months ago

Thanks for the quick response and happy pi day :)

Adding -DDEBUG to CXXFLAGS in lib/ leads to the following error:

pattern.cpp: In member function ‘void reflex::Pattern::gen_match_hfa_start(reflex::Pattern::DFA::State*, reflex::Pattern::HFA::State&, reflex::Pattern::HFA::StateHashes&)’:
pattern.cpp:4417:40: error: ‘state’ was not declared in this scope
 4417 |       DBGLOG("0 HFA %p: %u..%u -> %p", state, lo, hi, next_state);
      |                                        ^~~~~

The log statement in this function was introduced in v4.0.0 and at the same time the state parameter was renamed start. After replacing state with start in the DBGLOG() statement the build completes.

I've attached the DEBUG.log files for a run of reflex braille.l on NetBSD and Ubuntu. The diff -u output for logs with timestamps trimmed looks like this:

--- DEBUG-Ubuntu-trimmed.log    2024-03-14 16:21:37.863987366 -0700
+++ DEBUG-NetBSD-trimmed.log    2024-03-14 16:21:22.395122036 -0700
@@ -9,12 +9,18 @@
 pattern.cpp:1149 BEGIN parse4(5)
 pattern.cpp:1402 END parse4()
 pattern.cpp:1133 END parse3()
-pattern.cpp:988  BEGIN parse3(6)
-pattern.cpp:1149 BEGIN parse4(6)
+pattern.cpp:972  END parse2()
+pattern.cpp:851  END parse1()
 pattern.cpp:1402 END parse4()
 pattern.cpp:1133 END parse3()
-pattern.cpp:988  BEGIN parse3(7)
-pattern.cpp:1149 BEGIN parse4(7)
+pattern.cpp:972  END parse2()
+pattern.cpp:867  BEGIN parse2(8)
+pattern.cpp:988  BEGIN parse3(8)
+pattern.cpp:1149 BEGIN parse4(8)
+pattern.cpp:810  BEGIN parse1(9)
+pattern.cpp:867  BEGIN parse2(9)
+pattern.cpp:988  BEGIN parse3(9)
+pattern.cpp:1149 BEGIN parse4(9)
 pattern.cpp:1402 END parse4()
 pattern.cpp:1133 END parse3()
 pattern.cpp:972  END parse2()
[...]

Does this tell you anything?

DEBUG-NetBSD.log.gz DEBUG-Ubuntu.log.gz

genivia-inc commented 5 months ago

The diff shows that the regex pattern parsed from the braille.l file is different somehow, or the regex converter in lib/convert.cpp behaves differently to convert the regex pattern. I can see this from the parseN() calls that show this difference at a | (alternation) which aligns with the pattern errors:

"(?m)(<A0>)|(<A0>)|((?:⠀))

versus

"(?m)(⠰)|(⠁)|(⠃)

which has UTF-8 multibytes, not a single <A0> character that is not valid UTF-8.

The problem is either somewhere when the patterns are taken from the braille.l file or in the next stage, the converter that analysis the regex pattern strings to convert when needed, e.g. to convert Unicode regex to produce a regex that works with any 8-bit regex engine. So it may not be the pattern-to-DFA construction in pattern.cpp, which is all this DEBUG.log code.

A <A0> in the regex pattern makes no sense to see here. The Braille code blocks starts at U+2800 with multibyte E2 A0 80. But there is a single <A0> when the error occurs, is there? Or is there a part of a UTF-8 multibyte sequence? The generated lexer code has a very long regex pattern string that has these bytes. I wonder if it is and then chopped off?

What if you truncate the braille.l definitions to just a few patterns? Like so:

%%

<*>⠰            start(INITIAL);

<INITIAL,letters>{
⠁               emit("a"); start(letters);
}

<INITIAL>{
// CONTRACTIONS
⠃/{la}          emit("but");
}

<letters>{
// DIGRAPHS
⠂               emit("ea");
}

<*>⠼            start(numbers);

<numbers>{
⠁               emit("0");
}

<*>{
// PUNCTUATION
⠂/{la}          emit(",");
// SPACING
\u{2800}        emit(" "); start(INITIAL);
[ \t\n\r]       emit(text()); start(INITIAL);
// OTHER BRAILLE LIGATURES
\p{Braille}     emit(text());
}

<*>.            std::cerr << "Error in input at " << lineno() << std::endl;

%%

That should make it easier to see what the generated string looks like with the <A0>.

This is a very strange issue. I've avoided undefined behavior and things like char being unsigned or signed that differ with compilers/OS.

mdehling commented 5 months ago

Stepping through Reflex::get_pattern() in gdb led to the underlying issue:

On my NetBSD system, std::isspace() returns true for '\xE2' and '\xB0' and false for '\xA0'. On my Ubuntu test machine it returns false for all three characters.

As a result the pattern recognized on my NetBSD system is just the middle '\xA0' of the three byte UTF-8 sequence.

I'm not sure if this is a NetBSD libc bug or some configuration issue (the standard allows for locale-specific whitespace characters.) Either way, I'm sorry for the noise. It seems like this is not a RE-flex issue.

mdehling commented 5 months ago

Or this may be caused by passing a char to isspace() without casting to unsigned char first.

genivia-inc commented 5 months ago

The Reflex::get_pattern() uses several isspace() which are all applied to the char value of line.at(pos) where Reflex::line is a std::string.

The std::string::at() method returns a char reference. Whatever the sign of a char is on this platform should match with the signedness that std::isspace() expects as an int. We should not have to cast. Something is fundamentally wrong with this 🤔

mdehling commented 5 months ago

I think you are mistaken: see, e.g., https://en.cppreference.com/w/cpp/string/byte/isspace .

std::isspace expects the int argument to be in the range of an unsigned char, or equal to EOF. Anything else is undefined behavior.

(And while preparing a patch for this, I noticed that you do indeed handle this correctly in other places!)

genivia-inc commented 5 months ago

Even after many years of programming C and C++ and many other PL, some complacency sets in with such completely counter-intuitive constructs in C and C++.

mdehling commented 5 months ago

I 100% understand :) This is very unfortunate historical baggage in the C & C++ standards.

genivia-inc commented 5 months ago

Fixed in 4.1.2 along with a few other improvements based on ugrep issue discussions at https://github.com/Genivia/ugrep/issues/339