Compiler does not correctly interpret surrogate pairs when used in an identifier

dotnet / roslyn

The Roslyn .NET compiler provides C# and Visual Basic languages with rich code analysis APIs.

https://docs.microsoft.com/dotnet/csharp/roslyn-sdk/

MIT License

19.04k stars 4.03k forks source link

Compiler does not correctly interpret surrogate pairs when used in an identifier #9731

Open tannergooding opened 8 years ago

tannergooding commented 8 years ago

The C# specification states that an identifier can start with or contain anything matching letter-character, which is defined as:

letter-character::
A Unicode character of classes Lu, Ll, Lt, Lm, Lo, or Nl
A unicode-escape-sequence representing a character of classes Lu, Ll, Lt, Lm, Lo, or Nl

However, the compiler does not appear to correctly interpret some characters which match the above categories if they are part of a surrogate pair.

For example the sumerian character 𒅴 is categorized as 'OtherLetter' (matching 'Lo' above) when processed through char.GetUnicodeCategory("𒅴", 0).

However, the compiler is interpreting this character as two separate characters (and reporting CS1056 for both). It is likely checking each character individually, rather than checking if the first character is part of a surrogate pair and interpreting the character appropriately if it is.

tannergooding commented 8 years ago

FYI. @jaredpar, @dotnet/roslyn-compiler

Also, FYI @brettfo who helped discover the issue

AlekseyTs commented 8 years ago

Is this a regression?

tannergooding commented 8 years ago

@AlekseyTs, not a regression. It repro's back to the v2.0 native compiler. I don't expect this is a priority (especially since it hasn't been reported or discovered outside a random lunch discussion).

miloush commented 8 years ago

I did run into this issue in the past (with different characters). The reason I did not push it through was because the C# specification (version 5.0 anyway) actually says

For information on the Unicode character classes mentioned above, see The Unicode Standard, Version 3.0, section 4.5.

That is the only Unicode standard mentioned in references. Your character was not added to Unicode until 5.0. In fact, there were no characters above 0xFFFF in version 3.0.

On the other hand,

Since C# uses a 16-bit encoding of Unicode code points in characters and string values, a Unicode character in the range U+10000 to U+10FFFF is not permitted in a character literal and is represented using a Unicode surrogate pair in a string literal.

does not say anything about identifiers. And while it might be implied that 𒅴 is perhaps not allowed explicitly, the way I read the specification suggests that int \U00012174; should be allowed which isn't.

By the way, Some characters have been reclassified to different categories over time, it is not clear to me whether the specification suggests whether the state at version 3.0 must be used.

ufcpp commented 8 years ago

@miloush https://github.com/dotnet/roslyn/blob/master/docs/compilers/CSharp/Unicode%20Version.md

The Roslyn compilers depend on the underlying platform for their Unicode behavior. As a practical matter, that means that the new compiler will reflect changes in the Unicode standard.

miloush commented 8 years ago

@ufcpp it would be nonsense to do anything else, I am just not sure whether that's what the language specification suggests; might be worth updating it.

gafter commented 8 years ago

This is also reported as #13474

gafter commented 6 years ago

See also #13474 and #13560

gafter commented 6 years ago

The language specification for C# (both the online version and the ECMA version) permit characters larger than 16 bits in identifiers, either using a Unicode escape sequence or a surrogate pair of code points. It is a bug that the compiler does not accept them.

gafter commented 6 years ago

In order to fix this for VB too, we'd need an API to lowercase a UTF32 code point. See https://github.com/dotnet/corefx/issues/30879

gafter commented 5 years ago

The needed API has been added: see https://github.com/dotnet/corefx/issues/30879#issuecomment-445990453 and https://apisof.net/catalog/System.Text.Rune.ToLowerInvariant(Rune)

jaredpar commented 5 years ago

.NET Core only so can't that API for a while.

gafter commented 5 years ago

@jaredpar I expect we'd use the code that Stephen suggested in the shared library in C# until that someday, if those APIs are available.