Closed apblack closed 6 years ago
SmaCC is setup to restrict itself to 8 bits characters: compiling it to accept unicode is a bit involved (I'm pushing a test case where → is clear. I will add a grammar %unicode option to turn that on (there should be one).
In the future, I'm tempted to have SmaCC scanners work in UTF8 mode, one byte at a time. I have a use case that would require that, and that would make everything simpler (and probably significantly faster as well, as we've discovered recently here).
I can enlarge those \x expressions as well.
The SmaCC booklet speaks of checking the "Allow Unicode Characters" option, but I don't see any options for SmaCC. Is this obsolete, or am I not looking hard enough?
That option was on an older version of the GUI. I'll try to re-add it.
I think that making it a directive in the file makes more sense than a GUI button. That way, everything that one need to compile a grammar is in one place.
Yes.
I just hope that it doesn't trigger any strange things until the option is seen by the parser. The unicode enabler is a SmaCCGrammar class variable, and it seems tricky to change it (it changes it globally and permanently).
I've pushed a new version with that %unicode; option, limited for now at 16rFFFF. It should handle that → character.
I'll keep this issue open, because we need a redesign of that part. When dealing with unicode, one should never try to enumerate all characters.
Ok, the %unicode handling code works now, I can close this issue (and keep another one on the \xFFFF
limitation.
SmaCC seems to have trouble with unicode characters. The following definition produces "Error: collection is empty" when compiling the grammar.
It doesn't matter if the right-arrow is represented as
→
or\x2192
: the same error occurs during compilation.If the
→
is the only alternative specified for<arrow>
, i.e.then there is no compilation error, but the generated parser rejects an input consisting of
→
.I'm also not clear if Unicode characters with Codepoints greater than \xFFFF can be written in the grammar. The SmaCC book says that a maximum of four characters are accepted after the \x. This would make it impossible to mention useful Unicode characters such as 📅 or 💩, which should be part of any modern language 😉.