crystal-lang / crystal

The Crystal Programming Language
https://crystal-lang.org
Apache License 2.0
19.5k stars 1.62k forks source link

Stricter grammar for Unicode identifier names #11216

Open HertzDevil opened 3 years ago

HertzDevil commented 3 years ago

Any chracter whose Unicode codepoint is 0xA0 or above is allowed in identifier names (see https://github.com/crystal-lang/crystal/pull/10978#discussion_r674068256). This means the following is valid code:

​= 1
puts ​ # => 1

The above, in turn, is produced by the following Crystal code:

puts "\u200B= 1"
puts "puts \u200B"

That is, the variable's name is just the zero-width space character. To exclude those edge cases I suggest that we implement the Default Identifiers from UAX #31 (revision 35) with the following profile:

<Start> := XID_Start + U+005F
<Continue> := <Start> + XID_Continue
<Medial> :=

The above sets, when restricted to ASCII, are consistent with Crystal's existing lexer rules. Type and constant names must still begin with an ASCII uppercase letter.

For reference, this is also how C++23 does it (with the additional requirement that all identifiers must be in NFC).

straight-shoota commented 2 years ago

https://github.com/crystal-lang/crystal/pull/11508#discussion_r766923604 suggests to remove the Number Letter category from <Start>. I think that's fine. We can always reconsider and loosen this restriction if there are some legitimate use cases for identifiers starting with number letters.

straight-shoota commented 2 years ago

11508 has been reverted in #11687 because it was too restrictive resulting in a breaking change that's unacceptable for a minor release.

HertzDevil commented 1 year ago

Do we really want "most" emojis as identifiers? Ones that contain ZWJs won't be allowed anyway, unless the lexer operates on grapheme clusters and the compiler / standard library supports those sequences. I wouldn't miss being able to name my variables like 😂 and 🫱🏿‍🫲🏾.

We could simply bring #11508 back but use the new parser warning facilities instead.