Open aslakhellesoy opened 11 years ago
Useful SO thread: http://stackoverflow.com/questions/9611682/flexlexer-support-for-unicode
I have verified that with Ragel, multi-byte characters (such as å) work fine for recognition, but it puts the firstColumn
and lastColumn
values off, since they are based on ts
and te
, which seem to be counting bytes, not characters. This is not a huge problem since we're only likely to be using line numbers in error reporting anyway.
If Gherkin3 is going to use this project as a template, we have to make sure we can scan UTF-8 encoded input since many Gherkin translations rely on the unicode character set.
A simple way to do this is to create a
utf8
branch where we change&&
(AND) toøø
everywhere, both in lexer definitions and in tests. If everything passes we're fine, if not we have a problem....