dotnet / dotnet-api-docs

.NET API reference documentation (.NET 5+, .NET Core, .NET Framework)
https://docs.microsoft.com/dotnet/api/
Other
684 stars 1.52k forks source link

CultureInfo constructor / CreateSpecificCulture docs are misleading or incorrect #3110

Open mikernet opened 4 years ago

mikernet commented 4 years ago

Issue description

CultureInfo constructor and CultureInfo.CreateSpecificCulture() documentation is highly misleading and confusing.

  1. On both pages it states that the name parameter must be an existing culture, but it doesn't - it can be any well formed culture name, whether it exists or not.

  2. The exceptions section for CultureInfo.CreateSpecificCulture() states that it throws CultureNotFoundException when:

The culture specified by name does not have a specific culture associated with it.

I think it would be useful to note that in the docs this method returns a blank named CultureInfo object if the culture specified does not exist because that's rather counter-intuitive. Apparently it seems like sometimes the invariant culture behaves like a neutral culture and other times it doesn't (see https://github.com/dotnet/corefx/issues/40698#issuecomment-526785405)

  1. It says this in the "note to callers" section of both pages:

The .NET Framework 3.5 and earlier versions throw an ArgumentException if name is not a valid culture name. Starting with the .NET Framework 4, this method throws a CultureNotFoundException.

Why the heck was this changed? This seems like a terrible design decision. A badly formatted parameter should be throwing an argument exception, not an exception that indicates that it couldn't be found. The docs for CultureNotFoundException states:

The exception that is thrown when a method attempts to construct a culture that is not available.

That's clearly not what is happening here. These methods will happily work fine with cultures that are not available as long as the name is well formed.

Regardless, seems that the docs should clarify this somewhat because the current behavior is highly unexpected.

  1. It says this in the "note to callers" section of the CultureInfo constructor:

Starting with apps that run under the .NET Framework 4 or later on Windows 7 or later, the method attempts to retrieve a CultureInfo object whose identifier is culture from the operating system; if the operating system does not support that culture, the method throws a CultureNotFoundException exception.

As stated above, it does not do this...it returns a CultureInfo object with the specified name which falls back to the invariant culture regardless of whether it exists in the OS or not .

Target framework

rpetrusha commented 4 years ago

Thanks for opening this issue, @mikernet.

@tarekgh, it seems that on Linux, even a CultureInfo whose name is "aaa" doesn't throw a CultureNotFoundException. On all platforms, does the culture returned depend on the underlying OS (on Windows) and ICU library (on Unix-based systems)?

mikernet commented 4 years ago

Based on our other discussion threads and my testing, my understanding is that the exception is only thrown for invalid/badly formed culture names. Any culture that is formatted properly will not throw the exception even if it doesn't exist.

tarekgh commented 4 years ago

does the culture returned depend on the underlying OS (on Windows) and ICU library (on Unix-based systems)?

Yes, we just use ICU to create the culture on Linux. We just support whatever behavior Windows and ICU support.

mikernet commented 4 years ago

Looking at the source code, it doesn't just pass directly to OS or ICU though...it clearly does some massaging here to give consistent behavior:

https://github.com/dotnet/corefx/blob/master/src/Common/src/CoreLib/System/Globalization/CultureData.cs#L291

It also looks like region to culture mapping is hard-coded and not just reliant on ICU/windows:

https://github.com/dotnet/corefx/blob/master/src/Common/src/CoreLib/System/Globalization/CultureData.cs#L153

mikernet commented 4 years ago

Invariant behavior also looks hard coded for consistency:

https://github.com/dotnet/corefx/blob/master/src/Common/src/CoreLib/System/Globalization/CultureData.cs#L420

There are a lot of spots in CultureData files that appear to be trying to minimize the differences in behavior between Unix and Windows.

tarekgh commented 4 years ago

it clearly does some massaging here to give consistent behavior:

You are pointing to other place which doesn't create a culture. the code you are pointing it is just a code mapping a culture to a region.

It also looks like region to culture mapping is hard-coded and not just reliant on ICU/windows:

Same answer.

Invariant behavior also looks hard coded for consistency:

That is right except for collation behavior.

lsmith999999 commented 3 years ago

Not sure if this is the most appropriate place to post, but close enough.

I concur with (user) mikernet . Also see here for instance:

https://social.msdn.microsoft.com/Forums/vstudio/en-US/70f01bf8-290b-43e3-a2ae-08870220c3c8/cultureinfogetcultureinfo-behavior-change-on-windows-10?forum=netfxbcl

This is a serious breaking change, since "System.Globalization.CultureInfo" now identifies languages as valid in Windows 10 that previously resulted in an exception in Windows 7 and earlier. I only just discovered the problem in fact and Windows 10 has been out for years now!! It was a serious accident waiting to happen and just did (for one of my customers).

How can an unknown (non-existent) language code be accepted simply because it's considered well-formed (and even that's confusing, because a language code normally originates from ISO 639-1 which consists of two lowercase characters only AFAIK, and even the "CultureInfo::Name" docs currently indicates that - a few very quick and informal tests for instance shows that the "CultureInfo" constructor accepts any language code string <= 3 characters, like "aaa" or "a", while rejecting anything with 4 or more characters, assuming you're passing just the language code only and not the country/regioncode code as well).

In any case, passing something like "aaa" to the "CultureInfo" constructor would cause an exception in Windows 7 and earlier but now succeeds in Windows 10. That's a serious breaking change for anyone relying on the exception to determine an invalid language code (and who wouldn't be). Whether "CultureNotFoundException" or just plain "ArgumentException" (which user mikernet addresses), why is an unknown (invalid) language code even being accepted at all (and as mentioned in the link I cited at the outset, the resulting "CultureInfo::DisplayName" object even shows "Unknown Language (aaa)"). That's absurd behavior.

So how do you consistently check for an invalid (unknown) language? Sometimes an exception results, but for something like "aaa", it doesn't (in Windows 10 only AFAIK). If no exception should I therefore check "CultureInfo::LCID" for LOCALE_CUSTOM_UNSPECIFIED (4096), but if so, what about user-defined languages? Can they also return LOCALE_CUSTOM_UNSPECIFIED (I need to investigate). Note that "CultureTypes::UserCustomCulture" is officially deprecated so presumably I can't (reliably) rely on "CultureInfo::CultureTypes" to check for it (and it's unclear whether it would be reliable anyway when dealing with unknown languages).

The simple exception that used to easily identify non-existent languages is no longer reliable and the situation is now very confusing (in addition to how many people's code must have been quietly broken by this change).

tarekgh commented 3 years ago

@lsmith999999 .NET in general depends on the underlying OS to get the globalization behavior. It is Windows which decided to go with this behavior allowing any locale name that conform to BCP-47 standard.

In .NET 5.0 we have exposed the API CultureInfo.GetCultureInfo which can help you creating only cultures that the OS has a real data for.

https://github.com/dotnet/runtime/pull/42496

lsmith999999 commented 3 years ago

@tarekgh: Thanks for the feedback but the situation still remains confusing and the reason for the change bewildering, even if .NET depends on the OS itself. It doesn't have to even if it runs afoul of what the OS is doing, since it's a significant breaking change from earlier behavior going back many years. The change is also potentially very serious in my particular case (and possibly others), and it just quietly started without warning.

Programs don't read (updated) docs before they run, and even now (for us human readers), the docs don't mention that language codes which were were previously rejected as invalid are now considered valid (as if I and others routinely check the docs for unexpected breaking changes like this anyway). The "Notes to Callers" section under the "CultureInfo" constructor gives no clear indication of this whatsoever. It's unclear what the purpose of this change was anyway, even if it occurs at the OS level, and how it's beneficial. From my perspective (and others) it's problematic and confusing (see my previous response).

It also (still) remains unclear how to check for a valid (existing) language code now. Migrating to .NET 5.0 is hardly an option assuming the new "CultureInfo.GetCultureInfo" you mentioned actually works as expected, so it appears I may have to do a language lookup via "CultureInfo.GetCultures()" (the only reliable alternative I can think of off-hand).

Note that I looked at the link you posted (https://github.com/dotnet/runtime/pull/42496) and I'm not comfortable with this approach, which was in a state of flux at the time of posting anyway (and therefore potentially unstable at the time). It was a hack anyway (checking "CultureInfo.EnglishName" for a raw string like "Unknown Language" which had already occurred to me as possible solution anyway, before your post), but even relying on LOCALE_ICONSTRUCTEDLOCALE (the newer approach in the latter link) remains unclear. Where does this originate from in .NET without passing my language string to unmanaged code. For an invalid language like "aaa", the "CultureInfo.LCID" property is returning LOCALE_CUSTOM_UNSPECIFIED (as documented), not LOCALE_ICONSTRUCTEDLOCALE (this is an LCID is it not?). It's not even clear if LOCALE_CUSTOM_UNSPECIFIED can (will) be returned for custom cultures as well (like in earlier versions of Windows), but if it is, then LOCALE_CUSTOM_UNSPECIFIED can't be relied on since you then can't distinguish between a valid custom culture and one that doesn't actually exist.

The situation as it stands is ridiculous but if there's a clear fix then can someone at least document it (clearly).

tarekgh commented 3 years ago

It was a hack anyway (checking "CultureInfo.EnglishName" for a raw string like "Unknown Language" no-one should take dependency on "Unknown Language".

Yes, and that is why we fixed it.

but even relying on LOCALE_ICONSTRUCTEDLOCALE (the newer approach in the latter link) remains unclear.

I am not sure why you think so. it is very clear telling if the returned culture has a real data or not.

We discourage all users of using LCIDs property in general. it is kind of obsolete and you always should rely on the culture names instead. There are many scenarios that LCID wouldn't work.

edited ~Suggesting a change in the docs to clarify things is a reasonable ask, please feel free to log issue in https://github.com/dotnet/dotnet-api-docs/ and we'll be happy to do so.~

this issue is suggesting the change in the doc and we'll be happy to address that.

lsmith999999 commented 3 years ago

@tarekgh

I am not sure why you think so. it is very clear telling if the returned culture has a real data or not.

By "unclear" I meant unclear where to pull LOCALE_ICONSTRUCTEDLOCALE from in .NET (using managed code). You go on to say however:

We discourage all users of using LCIDs property in general. it is kind of obsolete and you always should rely on the culture names instead. There are many scenarios that LCID wouldn't work.

Agreed (that LCIDs are antiquated), but it seems to contradict your previous statement nevertheless, that "... it is very clear telling if the returned culture has a real data or not" (by relying on LOCALE_ICONSTRUCTEDLOCALE). Is the latter intended for your own internal use only (to drive functions like "CultureInfo.GetCultureInfo()"), or is it also intended for client developers. Not likely (you said so yourself), so what am I still misunderstanding. I still don't know what the recommended technique is for determining that "aaa" isn't a recognized language, presumably from BCP 47 normally (though other standards are sometimes mentioned such as RFC 4646, which is what "CultureInfo" is officially based on - I'm not an expert in all these standards though), or user-defined.

I can't rely on exceptions, or LCIDs, so what's the solution (in .NET 4.X)?

tarekgh commented 3 years ago

I am not sure why you are relating LOCALE_ICONSTRUCTEDLOCALE with the LCIDs? LOCALE_ICONSTRUCTEDLOCALE has nothing to do with LCIDs in general.

I can't rely on exceptions, or LCIDs, so what's the solution (in .NET 4.X)?

You can something like:

    [DllImport("kernel32.dll", CharSet = CharSet.Unicode)]
    internal static extern unsafe int GetLocaleInfoEx(string lpLocaleName, uint LCType, void* lpLCData, int cchData);

    public static unsafe bool IsPredefinedCulture(string cultureName)
    {
        const uint LOCALE_ICONSTRUCTEDLOCALE   = 0x0000007d;
        const uint LOCALE_RETURN_NUMBER        = 0x20000000;

        int value = 0;
        GetLocaleInfoEx(cultureName, LOCALE_ICONSTRUCTEDLOCALE | LOCALE_RETURN_NUMBER, (char*)&value, sizeof(int));

        return value != 1;
     }
lsmith999999 commented 3 years ago

@tarekgh

_I am not sure why you are relating LOCALE_ICONSTRUCTEDLOCALE with the LCIDs? LOCALEICONSTRUCTEDLOCALE has nothing to do with LCIDs in general

Because while I spend most of the time working in the WinAPI (in C++), I haven't looked (closely) at the Internationalization functions in some years now, and admittedly took the chance you were referring to some pre-defined LCID constant. It sounded that way though I wasn't certain (and didn't have the time confirm it).

Thank you for the function. It's appreciated. I'll take a look but am still unsure at this point. Most .NET developers aren't going to be familiar with this, and wouldn't expect to have to turn to something so low-level (and may not even have the C/C++ background usually required to deal with unsafe code). I'm only working in .NET for one particular (commercial localization) application I wrote many years ago (that I need to tend to periodically), but still find it very surprising I have to rely on something like this when dealing with a managed class like "CultureInfo" (to determine if a language string is valid). What's the point of the documented exceptions it throws if they can't be reliably used in all cases.

I'm very experienced in both the (unmanaged) WinAPI (30+ years) and .NET so can deal with the code without issue, but there's gotta be a cleaner (managed) way (one would think). Most would expect it.

lsmith999999 commented 3 years ago

@tarekgh

Just a follow-up that I tested the function you posted after reading the docs and the behavior is still mysterious. I haven't dug into the details to see what's really going on (the reason for this behavior), but it's effectively identical to "CultureInfo" itself (when passing an unknown culture name to its constructor for instance). Presumably "CultureInfo" is relying on "GetLocaleInfoEx()" (or equivalent) but again, I haven't looked into it.

For instance, the following 4 invalid (unknown) strings (cultureName) generates these results:

cultureName      "GetLocaleInfoEx()" result                                     "CultureInfo" constructor
-----------      --------------------------                                     -------------------------
"w"              Fails. "GetLastError()" returns ERROR_INVALID_PARAMETER        Fails with "CultureNotFoundException"
"wx"             Succeeds. "value" contains 1 ("wx" is a constructed locale)    Succeeds but "CultureInfo.EnglishName" shows "Unknown Language (wx)" and CultureInfo.LCID shows 4096 (LOCALE_CUSTOM_UNSPECIFIED)
"wxy"            Succeeds. "value" contains 1 ("wxy" is a constructed locale)   Succeeds but "CultureInfo.EnglishName" shows "Unknown Language (wxy)" and CultureInfo.LCID shows 4096 (LOCALE_CUSTOM_UNSPECIFIED)
"wxyz"           Fails. "GetLastError()" returns ERROR_INVALID_PARAMETER`       Fails with "CultureNotFoundException"

Presumably you specialize in this particular area or at least have deeper knowledge than myself (haven't looked at this stuff in years), so it may all seem perfectly normal (correct). It seems to confirm much of what's been discussed here but from my perspective (a client developer), it's not normal (expected or reasonable behavior), even if there's a legitimate reason for it. It may be the OS itself but it doesn't matter. At this stage I'll have to rely on a lookup of "cultureName" in "CultureInfo.GetCultures()" which has certain drawbacks of its own (including possible perfomance issues even if I cache "CultureInfo.GetCultures()" in a "HashSet").

Anyway, thanks for your assistance (appreciated).

tarekgh commented 3 years ago

I tweaked the code a little bit here. This code should give you the right result.

        [DllImport("kernel32.dll", CharSet = CharSet.Unicode)]
        internal static extern unsafe int GetLocaleInfoEx(string lpLocaleName, uint LCType, void* lpLCData, int cchData);

        public static unsafe bool IsPredefinedCulture(string cultureName)
        {
            const uint LOCALE_ICONSTRUCTEDLOCALE   = 0x0000007d;
            const uint LOCALE_RETURN_NUMBER        = 0x20000000;

            int value = 0;
            return GetLocaleInfoEx(cultureName, LOCALE_ICONSTRUCTEDLOCALE | LOCALE_RETURN_NUMBER, (char*)&value, sizeof(int)) > 0 && value != 1;
        }
lsmith999999 commented 3 years ago

@tarekgh

Better, thanks. I might have picked up on that myself but LOCALE_ICONSTRUCTEDLOCALE gets into a fuzzy area. Based on my interpretation at this point, it presumably works like this:

  1. "GetLocaleInfoEx()" fails if "cultureName" isn't a well-formed BCP-47 string
  2. "GetLocaleInfoEx()" succeeds if "cultureName" is a well-formed BCP-47 string, but if the culture doesn't actually exist, "value" is set to 1 ("Is a constructed locale"). Otherwise, if it does exist, "value" is set to 0 ("Not constructed").

Is this the final story and if so, is that the (exact) same behavior of "CultureInfo" (for determining a known culture). Presumably the .NET 5.0 version of "CultureInfo.GetCultureInfo()" (with the "predefinedOnly" arg) relies on it as well (based on the earlier link you posted, and I (safely) can too now (via LOCALE_ICONSTRUCTEDLOCALE)?

tarekgh commented 3 years ago

Is this the final story and if so, is that the (exact) same behavior of "CultureInfo" (for determining a known culture). Presumably the .NET 5.0 version of "CultureInfo.GetCultureInfo()" (with the "predefinedOnly" arg) relies on it as well (based on the earlier link you posted, and I (safely) can too now (via LOCALE_ICONSTRUCTEDLOCALE)?

the answers for all questions here is Yes :-)

lsmith999999 commented 3 years ago

Great, thanks for all your help (appreciated!!)

lsmith999999 commented 3 years ago

@tarekgh

Sorry, last word I hope (no need to respond unless req'd). I updated (corrected) bullet 2 in my last response in case anyone else relies on it. I originally reversed the meaning of 0 and 1 due to the confusing interpretation of "Is a constructed locale" vs "Not constructed" in the LOCALE_ICONSTRUCTEDLOCALE docs ("constructed" meaning the culture actually doesn't exist, not constructed meaning it does - easy to get this backwards and I did - mea culpa)

tarekgh commented 3 years ago

I suggest instead of saying constructed, you say culture which the underlying OS may not have a complete information. This could be more explicit and easier to understand.

lsmith999999 commented 3 years ago

Yes, I actually do get it, but there's a natural tendency to think that something which is "constructed" exists and something that's not constructed doesn't (just from the raw meaning of the word "constructed" coupled with the word "exists"). It's was just my brain on auto-pilot as it processed those two words in normal everyday use. In this case things work in reverse though (or rather you have to interpret it in a different context) but as a programmer of course you have to just look at the actual (documented) codes you're dealing with to overcome what your brain is thinking :)

Anyway, I think we're finally done. Thanks again!

ptr727 commented 1 year ago

Just found this issue, my code expected to throw when the string in the constructure is not valid.

The docs further state that CreateSpecificCulture() is just a wrapper, but in my testing on Win11 and dotnet 7.0.3 the returned object is distinctly different.

If the behavior is expected to be that "foo" "bar" "etc" will succeed simply because it looks like it could be ok, how do I test the attributes of the returned CI to detect if the culture was created on the fly, vs. actually exists?

tarekgh commented 1 year ago

how do I test the attributes of the returned CI to detect if the culture was created on the fly, vs. actually exists?

Use GetCultureInfo(String, Boolean) and pass true for the predefinedOnly parameter.

ptr727 commented 1 year ago

I ended up there via StackOverflow, and it seems to mostly work mostly, but for "und" it succeeds but returns a "ZZZ" for ThreeLetterWindowsLanguageName and UserCustomCulture, so I still need to add validation logic.

I know ISO639-3 "und" may be a special case, as it could translate to "UND" in RFC 5646, but I'd still like to know it is not valid, and let my logic test for "und" or "zxx", etc.

tarekgh commented 1 year ago

The solution I mentioned here https://github.com/dotnet/dotnet-api-docs/issues/3110#issuecomment-1470120329 is more reliable than what is suggested in the stack overflow. Why can't you use that?

ptr727 commented 1 year ago

I wasn't clear, I did end up using GetcultureInfo(language, true), but it does not work as expected for "und" and "zxx", so I still either have to hard code special codes, and I don't know them all, or I generically test for "ZZZ" and UserCustomCulture.