dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.96k stars 4.65k forks source link

Right to Left Cultures for numbers #60590

Open aburakab opened 2 years ago

aburakab commented 2 years ago

I'm not sure if this is a bug or not! Starting from .net5, Right to Left Cultures is dealing with "-1" in a weird way. May you please explain?

var str = "-1";
var num = Convert.ToInt32(str, CultureInfo.InvariantCulture);

var culture = new CultureInfo("en-US");
Thread.CurrentThread.CurrentCulture = culture;
str.Dump(CultureInfo.CurrentCulture.Name);
num.Dump(CultureInfo.CurrentCulture.Name);
num.ToString().Length.Dump($"{CultureInfo.CurrentCulture.Name} => ToString().Length");

culture = new CultureInfo("ar-SA");
Thread.CurrentThread.CurrentCulture = culture;
str.Dump(CultureInfo.CurrentCulture.Name);
num.Dump(CultureInfo.CurrentCulture.Name);
num.ToString().Length.Dump($"{CultureInfo.CurrentCulture.Name} => ToString().Length");

.Net 5 and 6 image

.Net 3.1 image

KalleOlaviNiemitalo commented 2 years ago

PowerShell 7.1.4 using .NET 5.0.9:

PS C:\> (-123).ToString([System.Globalization.CultureInfo]"ar-SA")
؜-123
PS C:\> (-123).ToString([System.Globalization.CultureInfo]"ar-SA")
؜-123
PS C:\> (-123).ToString([System.Globalization.CultureInfo]"ar-SA").EnumerateRunes() | % { "U+{0:X4}" -f $_.Value }
U+061C
U+002D
U+0031
U+0032
U+0033
PS C:\> ([System.Globalization.CultureInfo]"ar-SA").NumberFormat.NegativeSign.EnumerateRunes() | % { "U+{0:X4}" -f $_.Value }
U+061C
U+002D

The minus sign precedes the digit 1 in the string, but there is also U+061C ARABIC LETTER MARK, which I suppose affects how the Unicode bidirectional algorithm lays out the text.

aburakab commented 2 years ago

Ok, I see. The U+061C ARABIC LETTER MARK description:

"The Arabic letter mark (ALM) is a non-printing character used in the computerized typesetting of bi-directional text containing mixed left-to-right scripts (such as Latin and Cyrillic) and right-to-left scripts (such as Arabic, Syriac and Hebrew). Similar to Right-to-left mark (RLM), it is used to change the way adjacent characters are grouped with respect to text direction, with some difference on how it affects the bidirectional level resolutions for nearby characters." Reference.

This already affects one library that we are using. I already sent a PR for a library which affects our product. And I solve it using this line of code.

CultureInfo.CurrentCulture.NumberFormat = NumberFormatInfo.InvariantInfo;

Based on my experience in building Web Apps using Arabic Cultures, mixing characters was really a headache! so maybe this new implementation comes to solve this issue (which was not available in .net core 3.1 and before). Some old implementations were serializing or stringifying objects, they might be affected now.

KalleOlaviNiemitalo commented 2 years ago

From https://st.unicode.org/cldr-apps/v#/ar_SA/Symbols/, it looks like the "Symbols" section has minusSign = U+002D, but the "Symbols using Arabic-Indic Digits (Arab)" section has minusSign = U+061C U+002D. Perhaps then, it is a bug that .NET uses the latter even with the 0123456789 digits.

aburakab commented 2 years ago

Interesting! Thank you @KalleOlaviNiemitalo, let us see what they are going to reply :)

KalleOlaviNiemitalo commented 2 years ago

On which operating system do you run your code? If it is on Windows, you could use the compatibility setting to switch from ICU to NLS and work around the problem that way: .NET globalization and ICU

aburakab commented 2 years ago

I'm using Windows. I'll give it a try.

tarekgh commented 2 years ago

From st.unicode.org/cldr-apps/v#/ar_SA/Symbols, it looks like the "Symbols" section has minusSign = U+002D, but the "Symbols using Arabic-Indic Digits (Arab)" section has minusSign = U+061C U+002D. Then, it is a bug that .NET uses the latter even with the 0123456789 digits.

The shape of the digits is decided by the visual rendering engines and not on the level of formatting the string. For example, can use the Latin digits and get rendered as Hindi digits. Looking at the CLDR data, I am seeing even with the Latin, it will insert the Left to right Mark \u200E. So, either way will end up with some mark inserted with it. May be the .NET will consider removing such marks from the sign when reading it and continue to work as we used to do before using ICU? If we do that, it means the callers will need to ensure how the formatted numbers or anything is laid out correctly. By the way, such marks are inserted in the dates too, but it is really helping lay out the dates correctly, especially when having the months written in Arabic.

thoughts?

switch from ICU to NLS and work around the problem that way

I wouldn't recommend that. Switching back to NLS is not a desirable choice as it will lose other features and ICU is the future to go with. I would suggest you work around this either overwriting the negative sign in the culture or just processing the formatted string and removing the inserted marks.