Closed mishra14 closed 1 year ago
@mishra14 The roslyn compiler doesn't accept surrogate pairs. As long as the compiler uses string
- UTF-16 based string implementation, cost of accepting them is too high.
There is an easy but dangerous way to accept them: adding the following line in the UnicodeCharacterUtilities.IsLetterChar
method.
case UnicodeCategory.Surrogate:
after this line: https://github.com/dotnet/roslyn/blob/master/src/Compilers/Core/Portable/UnicodeCharacterUtilities.cs#L125
The Swift language adopts such a way. However, this introduces another problem. You can use any characters including symbols. The following link shows a valid source code in the Swift. It uses Mathematical Alphanumeric Symbols as identifier.
https://swiftlang.ng.bluemix.net/#/repl/57c4393a7adc02c56275043d
Thus, more correct way is using UnicodeCategory GetUnicodeCategory(string, int)
instead of UnicodeCategory GetUnicodeCategory(char)
but its performance hit is not trivial.
Treating surrogates properly in identifiers is nontrivial, but it doesn't have to be much of a performance hit. Only when surrogates are actually used would any surrogate-related code path be taken.
@mishra14 I am all for supporting surrogate pairs and would have been myself pointing the specification out, except that it explicitly refers to Unicode 3.0 where there were no such surrogates as far as I know...
@miloush More recent versions of the language specifications will refer to more recent versions of the Unicode specs. Sorry the C# 6 spec isn't out yet.
Actually, the older versions of the spec appear to say we should support surrogates. For example, see https://msdn.microsoft.com/en-us/library/aa664669(v=vs.71).aspx which implies that the following program should compile without error. Roslyn rejects it.
using System;
using System.Collections.Generic;
using System.Linq;
class Program
{
public static void Main(string[] args)
{
int \U00020000 = 10; http://www.fileformat.info/info/unicode/char/20000/index.htm
Console.WriteLine(\U00020000);
}
}
This is also reported as #9731
duping to #9731 which is the primary issue for this.
I am working on GB18030 certification prep for NuGet Visual Studio UI.
It seems that symbols/characters from CJK Unified Ideographs Extension B are not accepted as/in valid C# namespace in Visual Studio.
But according to the specs here, they should be since CJK Extension B falls into the Lo class of unicode.
Repro -
More ref - SO Question
//cc : @rrelyea