dotnet / roslyn

The Roslyn .NET compiler provides C# and Visual Basic languages with rich code analysis APIs.
https://docs.microsoft.com/dotnet/csharp/roslyn-sdk/
MIT License
18.92k stars 4.02k forks source link

Surrogate pairs not recognized in identifiers. #13474

Closed mishra14 closed 1 year ago

mishra14 commented 8 years ago

I am working on GB18030 certification prep for NuGet Visual Studio UI.

It seems that symbols/characters from CJK Unified Ideographs Extension B are not accepted as/in valid C# namespace in Visual Studio.

But according to the specs here, they should be since CJK Extension B falls into the Lo class of unicode.

Repro -

  1. Create new console app in VS.
  2. Change the namespace or class name to a combo of CJK ex B characters - 𠀀𠀁𠀂𠀃 (using the first 4 here)
  3. Observe that VS throws an error.

More ref - SO Question

//cc : @rrelyea

ufcpp commented 8 years ago

@mishra14 The roslyn compiler doesn't accept surrogate pairs. As long as the compiler uses string - UTF-16 based string implementation, cost of accepting them is too high.

There is an easy but dangerous way to accept them: adding the following line in the UnicodeCharacterUtilities.IsLetterChar method.

                case UnicodeCategory.Surrogate:

after this line: https://github.com/dotnet/roslyn/blob/master/src/Compilers/Core/Portable/UnicodeCharacterUtilities.cs#L125

image

The Swift language adopts such a way. However, this introduces another problem. You can use any characters including symbols. The following link shows a valid source code in the Swift. It uses Mathematical Alphanumeric Symbols as identifier.

https://swiftlang.ng.bluemix.net/#/repl/57c4393a7adc02c56275043d

Thus, more correct way is using UnicodeCategory GetUnicodeCategory(string, int) instead of UnicodeCategory GetUnicodeCategory(char) but its performance hit is not trivial.

gafter commented 8 years ago

Treating surrogates properly in identifiers is nontrivial, but it doesn't have to be much of a performance hit. Only when surrogates are actually used would any surrogate-related code path be taken.

miloush commented 8 years ago

@mishra14 I am all for supporting surrogate pairs and would have been myself pointing the specification out, except that it explicitly refers to Unicode 3.0 where there were no such surrogates as far as I know...

gafter commented 8 years ago

@miloush More recent versions of the language specifications will refer to more recent versions of the Unicode specs. Sorry the C# 6 spec isn't out yet.

gafter commented 8 years ago

Actually, the older versions of the spec appear to say we should support surrogates. For example, see https://msdn.microsoft.com/en-us/library/aa664669(v=vs.71).aspx which implies that the following program should compile without error. Roslyn rejects it.

using System;
using System.Collections.Generic;
using System.Linq;

class Program
{
    public static void Main(string[] args)
    {
        int \U00020000 = 10; http://www.fileformat.info/info/unicode/char/20000/index.htm
        Console.WriteLine(\U00020000);
    }
}
gafter commented 8 years ago

This is also reported as #9731

jaredpar commented 1 year ago

duping to #9731 which is the primary issue for this.