dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.25k stars 4.73k forks source link

CultureInfo.TextInfo.ListSeparator broken in .Net 5 #43795

Closed olmobrutall closed 3 years ago

olmobrutall commented 4 years ago

As part of updating of updating the Csv class in Signum.Utilities to .Net 5, I've realize that the value for ListSeparator has changed for many cultures that use , instead of . for decimals separators.

foreach (var ci in new [] { "", "en", "en-GB", "en-US", "es", "es-ES", "de", "de-DE", "fr", "fr-FR" })
{
    Console.WriteLine($"Culture {ci}\tListSeparator {CultureInfo.GetCultureInfo(ci).TextInfo.ListSeparator}");
}

netcoreapp3.1

Culture         ListSeparator ,
Culture en      ListSeparator ,
Culture en-GB   ListSeparator ,
Culture en-US   ListSeparator ,
Culture es      ListSeparator ;
Culture es-ES   ListSeparator ;
Culture de      ListSeparator ;
Culture de-DE   ListSeparator ;
Culture fr      ListSeparator ;
Culture fr-FR   ListSeparator ;

net5 rc2

Culture         ListSeparator ,
Culture en      ListSeparator ,
Culture en-GB   ListSeparator ,
Culture en-US   ListSeparator ,
Culture es      ListSeparator .
Culture es-ES   ListSeparator .
Culture de      ListSeparator .
Culture de-DE   ListSeparator .
Culture fr      ListSeparator  
Culture fr-FR   ListSeparator  

Both tests in the same machine.

Previously I was using ListSeparator as a way to detect the separator that will be used in a Csv file. Is this the intended purpose?

rubenprins commented 3 years ago

@tarekgh what's wrong with Number symbols list element (https://unicode.org/reports/tr35/tr35-numbers.html#Number_Symbols)? Because for the few European cultures I checked that have ';' as a list separator on Windows, this property is also ';' in CLDR.

Also, the list pattern you mention is a generic list separator for itemizing AFAICT, whereas sList/ListSeparator is used to group numbers (where ',' is not possible for European cultures, due to that character already being the decimal separator -- and I can attest to this European usage of ”;” in case of ambiguities in case of writing number sequences and intervals).

The fact that all other characters in the Numbers section in the regional settings also correspond to the symbols section from CLDR also points at that particular solution.

Yes, ListSeparatorn is wonderfully vague on what kind of lists it is talking about, but judging from the context of the regional settings and its use by lots of software this way (Excel, CSV), I think the list symbol makes much more sense, and is probably mostly backward compatible with NLS data.

KalleOlaviNiemitalo commented 3 years ago

CLDR 38 has a semicolon ; without spaces as the number list separator in en.xml, and en_US.xml does not override it. So it's not the same as the Windows NLS data, which has used a comma , for en-US.

Anyway, I now think the CLDR number list separator is a better fit for TextInfo.ListSeparator than what one could extract from list patterns, because it does not end with a space. An API for linguistic list formatting using CLDR list patterns could then be added in .NET 6 or 7 as a separate issue.

(The country-dependent list separator was apparently added to INT 21H AH=38H in MS-DOS 3.0, and only had room for one byte of text and a null character.)

tarekgh commented 3 years ago

@rubenprins @KalleOlaviNiemitalo TextInfo.ListSeparator is not for numbers only. it can separate any kind of data. This is why list pattern would fit better. Also if you look at https://unicode.org/reports/tr35/tr35-numbers.html#Number_Symbols it is clearly says

- symbol used to separate numbers in a list intended to represent structured data such as an array; must be different from the decimal value. This list separator is for “non-linguistic” usage as opposed to the listPatterns for “linguistic” lists (e.g. “Bob, Carol, and Ted”) described in Part 2, Section 11 List Patterns.

Which means this number separator is not for linguistic while TextInfo.ListSeprator is a cultural property which should be linguistic. That is exactly match the list pattern in CLDR.

As @KalleOlaviNiemitalo indicated, even CLDR number separator will not be compatible with NLS. If really non-linguistic separator needed here, then I would say let it be , for all locales which I think will make CSV/Excel scenarios less broken.

To reiterate on CSV/Excel, I am still not seeing a way can guarantee ListSeparator will make this scenario reliable. This is wrong usage of ListSeparator and apps should depend on the real Excel doc separator. We'll keep ListSeparator read the user override for the current culture so most of the scenarios (as @KalleOlaviNiemitalo thankfully pointed at) like PoswerShell and COM interop will continue work as it used to be.

If people prefer the NLS compatibility, we can keep this but this is the least preferable to me because this is Windows specific and also this data can change at any time too in Windows. I am still thinking the data I mentioned https://github.com/dotnet/runtime/issues/43795#issuecomment-722049526 would be the most sensible data we can return for this separator. what you think?

tarekgh commented 3 years ago

To move forward, I would like to get your feedback which of the following options would be preferable for your scenarios:

As I mentioned before I think the first option make more sense but will still be different than NLS and will not be fully compatible. The second option would be preferrable only I guess if cares more about compatibility behavior and to be less breaking (as it is used with CSV files).

tarekgh commented 3 years ago

If I didn't hear from you, I'll go with the NLS data to keep the compatibility and in the future releases we can expose new APIs if needed for list patterns.

olmobrutall commented 3 years ago

Hi @tarekgh. Not sure if you referred to me but NLS looks better to me too.

It is possible to deliver this behavior in Unix? Or the behavior will be different in Unix and Windows?

olmobrutall commented 3 years ago

BTW, I just tested in exporting a CSV with Excel in my wife MacBook Pro configured with German and Spanish cultures, the results are consistent with Excel in Windows:

Hi;There;1.500;1,5

What about asking Office team what are they using to get the ; as list separator on Unix environments?

tarekgh commented 3 years ago

It is possible to deliver this behavior in Unix? Or the behavior will be different in Unix and Windows?

The idea is to deliver the behavior in both Windows and Unix. we'll hardcode the data and use it for all platforms.

Thanks for trying your scenario on Mac. Usually Office carry some data for that but I'll double check anyway.

ANahr commented 3 years ago

To move forward, I would like to get your feedback which of the following options would be preferable for your scenarios:

  • Use the CLDR list pattern data as I listed in the top of #43795 (comment).
  • Use NLS data listed at the bottom of same comment #43795 (comment).
  • Just return , for all cultures.

As I mentioned before I think the first option make more sense but will still be different than NLS and will not be fully compatible. The second option would be preferrable only I guess if cares more about compatibility behavior and to be less breaking (as it is used with CSV files).

IMHO the most unproblematic choice would be to keep NLS hard-coded

Second one would be use CLDR list SYMBOL (e.g see https://github.com/unicode-org/cldr/blob/d1c59aeea6a26cd1018e164e8c470e4873925cac/common/main/en.xml#L4307). CLDR list pattern does not make any sense to me because we look for a symbol, not for a pattern which can be arbitrarily complex.

Third one would be hard code something that would universally work (should not be something that EVER collides with decimal, so this cannot be ',') maybe ';', but I didn't check if that collides somewhere.

tarekgh commented 3 years ago

IMHO the most unproblematic choice would be to keep NLS hard-coded

That is what we decided to go with. I think this would address all currently broken scenarios.

tarekgh commented 3 years ago

This now fixed by the PR https://github.com/dotnet/runtime/pull/44823