adrian-thurston / ragel

Ragel State Machine Compiler
MIT License
532 stars 46 forks source link

[feature request] Basic utf-8 support in rule language #4

Open jbulow opened 4 years ago

jbulow commented 4 years ago

When writing rules for a W3C standard like SPARQL using the notation specified in the XML standard it would be quite convenient if there was some support of converting unicode code points to machines matching utf-8 encoded code point.

E.g. (from the SPARQL specification):

#x00c0 converts to the state machine 0xc3 0x80 [#0x00c0-0x00d0]converts to the state machine 0xc3 0x80..0x96 [#x037F-#x1FFF]converts to 0xCD 0xBF | 0xCE..0xDF 0x80..0xBF | 0xE0..0xE1 0x80..0xBF 0x80..0xBF

(I have not yet verified the examples above, but I hope the idea is clear)

The syntax \u could be used as an alternative to the syntax #x for code points.

jbulow commented 4 years ago

Some releated information: https://www.w3.org/2005/03/23-lex-U