ceylon / ceylon-spec

DEPRECATED
Apache License 2.0
108 stars 34 forks source link

What makes up an identifier? Grammar vs Spec (Unicode fun) #960

Closed lucaswerkmeister closed 10 years ago

lucaswerkmeister commented 10 years ago

The spec claims that

The lexer classifies Unicode uppercase letters, lowercase letters, and numeric characters depending on the general category of the character as defined by the Unicode standard. A LowercaseLetter is any character whose general category is Ll. An UppercaseLetter is any character whose general category is Lu, Lt, or Lo. A Number is any character whose general category is Nd, Nl, or No.

However, this is not what the lexer actually does:

fragment
Letter
    : 'a'..'z'
    | 'A'..'Z'
    | '\u00c0'..'\ufffe'
    ;

These are not equivalent: For example, the ʰ has the Unicode category lm (letter, modifier), which the spec doesn’t mention – but as the character is U+02B0 ≤ U+FFFE, the grammar accepts it. Here I would suggest that any Unicode letter should be allowed as IdentifierPart. On the other hand, the parser also accepts the Narrow No-Break Space ‘ ’ (U+202F) as part of an identifier, which is of course wrong. (I also tried to find an example where the lexer would fail to recognize a letter from the Supplementary Multilingual Plane, but it didn’t work, probably because of surrogate pair stuff.)

Code example (copy+paste it into your IDE):

class “This is one identifier – crazy,isn’t it?”(){}
lucaswerkmeister commented 10 years ago

Wow, it seems like ANTLR doesn’t have an easy solution to that – this Java grammar (they serve the file as HTML, view source to see the line breaks) contains almost 700 lines of character ranges for IdentifierStart and IdentifierPart :D

FroMage commented 10 years ago

Does seem like a bug, rather than a spec oversight.

lucaswerkmeister commented 10 years ago

You think that the spec should allow this!?

class “This is one identifier – crazy,isn’t it?”(){}
FroMage commented 10 years ago

No, I'm saying the spec is right and it's a grammar bug ;)

lucaswerkmeister commented 10 years ago

Ah, I misread the order of “rather than” :) sorry!

gavinking commented 10 years ago

In fact this stuff is supposed to be handled by the following grammar rule:

fragment 
IdentifierPart
    :   '_'
    |   Digit
    |   Letter
        { if (!Character.isJavaIdentifierPart($text.codePointAt(0))) { 
          //TODO: error!
        } }
    ;

But oops: look at that TODO!

gavinking commented 10 years ago

Fixed, I think. The only question I have is: do all Unicode letters and digits lie within the ranges 'a'..'z', 'A'..'Z', '0'..'1', and '\u00c0'..'\ufffe'?

lucaswerkmeister commented 10 years ago

Well, there’s still the category “Letter, Modifier”. If you allowed that, checking for validity would be easier.

Apart from that… I don’t know how the surrogate characters work in ANTLR. I tried to trip it up with the LINEAR B IDEOGRAM B100 MAN ‘𐂀’, which is a “Letter, Other” with code U+10080 (second 16 bytes are not in range c0-fffe), but that’s still recognized as an identifier.

gavinking commented 10 years ago

Answering my own question: no, it looks like the correct range would be '\u0080'..'\u2FA1F'.

gavinking commented 10 years ago

OK, I believe this is fixed now.