Genivia / RE-flex

A high-performance C++ regex library and lexical analyzer generator with Unicode support. Extends Flex++ with Unicode support, indent/dedent anchors, lazy quantifiers, functions for lex and syntax error reporting and more. Seamlessly integrates with Bison and other parsers.
https://www.genivia.com/doc/reflex/html
BSD 3-Clause "New" or "Revised" License
523 stars 86 forks source link

how to build lexer supports uncode? #184

Closed haihuayang closed 1 year ago

haihuayang commented 1 year ago

Hello,

I made following change to build wc with unicode support, but when I tried some utf8 input, wc does not report correct number of characters, instead it just report the number of bytes, same as without the option '--unicode'. Is there anything I missed?

Thanks,

$ git diff
diff --git a/examples/Make b/examples/Make
index 4080e27c..474e1049 100644
--- a/examples/Make
+++ b/examples/Make
@@ -326,7 +326,7 @@ calc:               calc.l calc.y
                ./calc < calc.test

 wc:            wc.l
-               $(REFLEX) $(REFLAGS) --flex wc.l
+               $(REFLEX) $(REFLAGS) --flex --unicode wc.l
                $(CXX) $(CXXFLAGS) -o $@ lex.yy.cpp $(LIBREFLEX)

 wcu:           wcu.l
$ echo 好 | ./wc
       1       1       4
genivia-inc commented 1 year ago

Please read the documentation on how to use RE/flex if you want to make changes.

The yyleng value is always backward compatible with Flex to count bytes, not characters. Same as the size() method. Use wsize() method to count (wide) characters, i.e. Unicode characters.