Closed CeylonMigrationBot closed 8 years ago
[@lucaswerkmeister] Wow, it seems like ANTLR doesn’t have an easy solution to that – this Java grammar (they serve the file as HTML, view source to see the line breaks) contains almost 700 lines of character ranges for IdentifierStart
and IdentifierPart
:D
[@FroMage] Does seem like a bug, rather than a spec oversight.
[@lucaswerkmeister] You think that the spec should allow this!?
class “This is one identifier – crazy,isn’t it?”(){}
[@FroMage] No, I'm saying the spec is right and it's a grammar bug ;)
[@lucaswerkmeister] Ah, I misread the order of “rather than” :) sorry!
[@gavinking] In fact this stuff is supposed to be handled by the following grammar rule:
fragment
IdentifierPart
: '_'
| Digit
| Letter
{ if (!Character.isJavaIdentifierPart($text.codePointAt(0))) {
//TODO: error!
} }
;
But oops: look at that TODO!
[@gavinking] Fixed, I think. The only question I have is: do all Unicode letters and digits lie within the ranges 'a'..'z'
, 'A'..'Z'
, '0'..'1'
, and '\u00c0'..'\ufffe'
?
[@lucaswerkmeister] Well, there’s still the category “Letter, Modifier”. If you allowed that, checking for validity would be easier.
Apart from that… I don’t know how the surrogate characters work in ANTLR. I tried to trip it up with the LINEAR B IDEOGRAM B100 MAN ‘𐂀’, which is a “Letter, Other” with code U+10080 (second 16 bytes are not in range c0-fffe), but that’s still recognized as an identifier.
[@gavinking] Answering my own question: no, it looks like the correct range would be '\u0080'..'\u2FA1F'
.
[@gavinking] OK, I believe this is fixed now.
[@lucaswerkmeister] The spec claims that
However, this is not what the lexer actually does:
These are not equivalent: For example, the ʰ has the Unicode category lm (letter, modifier), which the spec doesn’t mention – but as the character is U+02B0 ≤ U+FFFE, the grammar accepts it. Here I would suggest that any Unicode letter should be allowed as
IdentifierPart
. On the other hand, the parser also accepts the Narrow No-Break Space ‘ ’ (U+202F) as part of an identifier, which is of course wrong. (I also tried to find an example where the lexer would fail to recognize a letter from the Supplementary Multilingual Plane, but it didn’t work, probably because of surrogate pair stuff.)Code example (copy+paste it into your IDE):
[Migrated from ceylon/ceylon-spec#960] [Closed at 2014-05-09 19:48:24]