Unicode characters are not handled correctly

SmaCCRefactoring / SmaCC

Smalltalk Compiler Compiler : a parser generator

Other

33 stars 15 forks source link

Unicode characters are not handled correctly #29

Closed apblack closed 6 years ago

apblack commented 7 years ago

SmaCC seems to have trouble with unicode characters. The following definition produces "Error: collection is empty" when compiling the grammar.

<arrow>:  ->|<-|→;

Program
    : <arrow>
    ;

It doesn't matter if the right-arrow is represented as → or \x2192: the same error occurs during compilation.

If the → is the only alternative specified for <arrow>, i.e.

<arrow>:  →;

then there is no compilation error, but the generated parser rejects an input consisting of →.

I'm also not clear if Unicode characters with Codepoints greater than \xFFFF can be written in the grammar. The SmaCC book says that a maximum of four characters are accepted after the \x. This would make it impossible to mention useful Unicode characters such as 📅 or 💩, which should be part of any modern language 😉.

ThierryGoubier commented 7 years ago

SmaCC is setup to restrict itself to 8 bits characters: compiling it to accept unicode is a bit involved (I'm pushing a test case where → is clear. I will add a grammar %unicode option to turn that on (there should be one).

In the future, I'm tempted to have SmaCC scanners work in UTF8 mode, one byte at a time. I have a use case that would require that, and that would make everything simpler (and probably significantly faster as well, as we've discovered recently here).

I can enlarge those \x expressions as well.

apblack commented 7 years ago

The SmaCC booklet speaks of checking the "Allow Unicode Characters" option, but I don't see any options for SmaCC. Is this obsolete, or am I not looking hard enough?

ThierryGoubier commented 7 years ago

That option was on an older version of the GUI. I'll try to re-add it.

apblack commented 7 years ago

I think that making it a directive in the file makes more sense than a GUI button. That way, everything that one need to compile a grammar is in one place.

ThierryGoubier commented 7 years ago

Yes.

I just hope that it doesn't trigger any strange things until the option is seen by the parser. The unicode enabler is a SmaCCGrammar class variable, and it seems tricky to change it (it changes it globally and permanently).

ThierryGoubier commented 7 years ago

I've pushed a new version with that %unicode; option, limited for now at 16rFFFF. It should handle that → character.

I'll keep this issue open, because we need a redesign of that part. When dealing with unicode, one should never try to enumerate all characters.

ThierryGoubier commented 6 years ago

Ok, the %unicode handling code works now, I can close this issue (and keep another one on the \xFFFF limitation.