crystal-lang / crystal

The Crystal Programming Language
https://crystal-lang.org
Apache License 2.0
19.32k stars 1.61k forks source link

Unicode identifiers that are not normalized #11222

Open HertzDevil opened 2 years ago

HertzDevil commented 2 years ago

The following error might or might not be expected, but it is certainly confusing to look at:

à = 1  # U+00E0
puts à # U+0061 U+0300
# Error: undefined local variable or method 'à' for top-level

This is because the compiler compares the variable names by codepoints directly and does not perform any Unicode normalization. I tested the same in other languages:

straight-shoota commented 2 years ago

Just to make sure: "Python and Rust work" means they treat different non-normalized representations of the same normalized form as identical?

straight-shoota commented 2 years ago

I think I'd probably prefer to merge identifiers on their normalized form. The formatter can run normalization. That means, there would effectively be almost exclusively normalized forms in actual code (unless you don't use the formatter).

We could also consider disallowing non-normalized forms. It saves us to do normalization in the formatter. But it might be restrictive for some environments, not sure.