eclipse-archived / ceylon

The Ceylon compiler, language module, and command line tools
http://ceylon-lang.org
Apache License 2.0
399 stars 62 forks source link

What makes up an identifier? Grammar vs Spec (Unicode fun) #4066

Closed CeylonMigrationBot closed 8 years ago

CeylonMigrationBot commented 10 years ago

[@lucaswerkmeister] The spec claims that

The lexer classifies Unicode uppercase letters, lowercase letters, and numeric characters depending on the general category of the character as defined by the Unicode standard. A LowercaseLetter is any character whose general category is Ll. An UppercaseLetter is any character whose general category is Lu, Lt, or Lo. A Number is any character whose general category is Nd, Nl, or No.

However, this is not what the lexer actually does:

fragment
Letter
    : 'a'..'z'
    | 'A'..'Z'
    | '\u00c0'..'\ufffe'
    ;

These are not equivalent: For example, the ʰ has the Unicode category lm (letter, modifier), which the spec doesn’t mention – but as the character is U+02B0 ≤ U+FFFE, the grammar accepts it. Here I would suggest that any Unicode letter should be allowed as IdentifierPart. On the other hand, the parser also accepts the Narrow No-Break Space ‘ ’ (U+202F) as part of an identifier, which is of course wrong. (I also tried to find an example where the lexer would fail to recognize a letter from the Supplementary Multilingual Plane, but it didn’t work, probably because of surrogate pair stuff.)

Code example (copy+paste it into your IDE):

class “This is one identifier – crazy,isn’t it?”(){}

[Migrated from ceylon/ceylon-spec#960] [Closed at 2014-05-09 19:48:24]

CeylonMigrationBot commented 10 years ago

[@lucaswerkmeister] Wow, it seems like ANTLR doesn’t have an easy solution to that – this Java grammar (they serve the file as HTML, view source to see the line breaks) contains almost 700 lines of character ranges for IdentifierStart and IdentifierPart :D

CeylonMigrationBot commented 10 years ago

[@FroMage] Does seem like a bug, rather than a spec oversight.

CeylonMigrationBot commented 10 years ago

[@lucaswerkmeister] You think that the spec should allow this!?

class “This is one identifier – crazy,isn’t it?”(){}
CeylonMigrationBot commented 10 years ago

[@FroMage] No, I'm saying the spec is right and it's a grammar bug ;)

CeylonMigrationBot commented 10 years ago

[@lucaswerkmeister] Ah, I misread the order of “rather than” :) sorry!

CeylonMigrationBot commented 10 years ago

[@gavinking] In fact this stuff is supposed to be handled by the following grammar rule:

fragment 
IdentifierPart
    :   '_'
    |   Digit
    |   Letter
        { if (!Character.isJavaIdentifierPart($text.codePointAt(0))) { 
          //TODO: error!
        } }
    ;

But oops: look at that TODO!

CeylonMigrationBot commented 10 years ago

[@gavinking] Fixed, I think. The only question I have is: do all Unicode letters and digits lie within the ranges 'a'..'z', 'A'..'Z', '0'..'1', and '\u00c0'..'\ufffe'?

CeylonMigrationBot commented 10 years ago

[@lucaswerkmeister] Well, there’s still the category “Letter, Modifier”. If you allowed that, checking for validity would be easier.

Apart from that… I don’t know how the surrogate characters work in ANTLR. I tried to trip it up with the LINEAR B IDEOGRAM B100 MAN ‘𐂀’, which is a “Letter, Other” with code U+10080 (second 16 bytes are not in range c0-fffe), but that’s still recognized as an identifier.

CeylonMigrationBot commented 10 years ago

[@gavinking] Answering my own question: no, it looks like the correct range would be '\u0080'..'\u2FA1F'.

CeylonMigrationBot commented 10 years ago

[@gavinking] OK, I believe this is fixed now.