Open tannergooding opened 8 years ago
FYI. @jaredpar, @dotnet/roslyn-compiler
Also, FYI @brettfo who helped discover the issue
Is this a regression?
@AlekseyTs, not a regression. It repro's back to the v2.0 native compiler. I don't expect this is a priority (especially since it hasn't been reported or discovered outside a random lunch discussion).
I did run into this issue in the past (with different characters). The reason I did not push it through was because the C# specification (version 5.0 anyway) actually says
For information on the Unicode character classes mentioned above, see The Unicode Standard, Version 3.0, section 4.5.
That is the only Unicode standard mentioned in references. Your character was not added to Unicode until 5.0. In fact, there were no characters above 0xFFFF in version 3.0.
On the other hand,
Since C# uses a 16-bit encoding of Unicode code points in characters and string values, a Unicode character in the range U+10000 to U+10FFFF is not permitted in a character literal and is represented using a Unicode surrogate pair in a string literal.
does not say anything about identifiers. And while it might be implied that ð’…´ is perhaps not allowed explicitly, the way I read the specification suggests that int \U00012174;
should be allowed which isn't.
By the way, Some characters have been reclassified to different categories over time, it is not clear to me whether the specification suggests whether the state at version 3.0 must be used.
@miloush https://github.com/dotnet/roslyn/blob/master/docs/compilers/CSharp/Unicode%20Version.md
The Roslyn compilers depend on the underlying platform for their Unicode behavior. As a practical matter, that means that the new compiler will reflect changes in the Unicode standard.
@ufcpp it would be nonsense to do anything else, I am just not sure whether that's what the language specification suggests; might be worth updating it.
This is also reported as #13474
See also #13474 and #13560
The language specification for C# (both the online version and the ECMA version) permit characters larger than 16 bits in identifiers, either using a Unicode escape sequence or a surrogate pair of code points. It is a bug that the compiler does not accept them.
In order to fix this for VB too, we'd need an API to lowercase a UTF32 code point. See https://github.com/dotnet/corefx/issues/30879
The needed API has been added: see https://github.com/dotnet/corefx/issues/30879#issuecomment-445990453 and https://apisof.net/catalog/System.Text.Rune.ToLowerInvariant(Rune)
.NET Core only so can't that API for a while.
@jaredpar I expect we'd use the code that Stephen suggested in the shared library in C# until that someday, if those APIs are available.
The C# specification states that an
identifier
can start with or contain anything matchingletter-character
, which is defined as:However, the compiler does not appear to correctly interpret some characters which match the above categories if they are part of a surrogate pair.
For example the sumerian character
ð’…´
is categorized as 'OtherLetter' (matching 'Lo' above) when processed throughchar.GetUnicodeCategory("ð’…´", 0)
.However, the compiler is interpreting this character as two separate characters (and reporting CS1056 for both). It is likely checking each character individually, rather than checking if the first character is part of a surrogate pair and interpreting the character appropriately if it is.