Prove that we can use UTF-8

cucumber-attic / bool

Cross-platform boolean expression parser and interpreter

MIT License

35 stars 14 forks source link

Prove that we can use UTF-8 #38

Open aslakhellesoy opened 11 years ago

aslakhellesoy commented 11 years ago

If Gherkin3 is going to use this project as a template, we have to make sure we can scan UTF-8 encoded input since many Gherkin translations rely on the unicode character set.

A simple way to do this is to create a utf8 branch where we change && (AND) to øø everywhere, both in lexer definitions and in tests. If everything passes we're fine, if not we have a problem....

aslakhellesoy commented 11 years ago

Useful SO thread: http://stackoverflow.com/questions/9611682/flexlexer-support-for-unicode

aslakhellesoy commented 11 years ago

I have verified that with Ragel, multi-byte characters (such as å) work fine for recognition, but it puts the firstColumn and lastColumn values off, since they are based on ts and te, which seem to be counting bytes, not characters. This is not a huge problem since we're only likely to be using line numbers in error reporting anyway.