dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.18k stars 4.72k forks source link

es-ES Culture (and possibly others) Incorrectly (I think?) format AMDesignator/PMDesignator #77585

Closed SeanKilleen closed 1 year ago

SeanKilleen commented 1 year ago

Description

As part of a client project using .NET 6, I am displaying a date/time in the es-ES locale.

I ran into an issue where a unit test around the time formatting failed, but the two strings looked exactly alike. So I started digging and I think it's possible there may be a small issue (or it may be by design).

Reproduction Steps

In a .NET 6 with xUnit and FluentAssertions:

See the formatting issue:

[Fact]
public void TimeShowsUpCorrectlyForES()
{
    var cultureInfo = CultureInfo.GetCultureInfo("es-ES");
    var testDate = new DateTime(2022, 10, 27, 12, 0, 0);

    var formattedString = testDate.ToString("tt", cultureInfo);

    // This will fail, even though the string appears to the eye to match.
    formattedString.Should().Be("p. m."); // I typed "p. m." by hand there, as I would in a unit test
}

So then I figured I'd check the character:

[Fact]
public void PMUsesAscii32()
{
    var cultureInfo = CultureInfo.GetCultureInfo("es-ES");
    var testDate = new DateTime(2022, 10, 27, 12, 0, 0);

    var formattedString = testDate.ToString("tt", cultureInfo);

    var spaceCharInString = formattedString[2];
    var spaceCharAsInt = (int)spaceCharInString;

    // Fails -- Result is 160, an ASCII Non-breaking space
    spaceCharAsInt.Should().Be(32);
}

I know as of .NET 5 (IIRC), standardization was moved to ICU.

So I looked at the ICU Locale file: https://github.com/unicode-org/icu/blob/main/icu4c/source/data/locales/es.txt

And I copied / pasted those values into InlineData in another test:

[Theory]
[InlineData("GregorianAm", "a. m.")]
[InlineData("GregorianPm", "p. m.")]
[InlineData("MarkersAm", "a. m.")]
[InlineData("MarkersPm", "p. m.")]
[InlineData("NarrowAm", "a. m.")]
[InlineData("NarrowPm", "p. m.")]
[InlineData("StandaloneAbbrevAM", "a. m.")]
[InlineData("StandaloneAbbrevPm", "p. m.")]
[InlineData("NarrowAM", "a. m.")]
[InlineData("NarrowPM", "p. m.")]
[InlineData("WideAM", "a. m.")]
[InlineData("WidePM", "a. m.")]
[InlineData("DayPeriodAM", "a. m.")]
[InlineData("DayPeriodPM", "p. m.")]
[InlineData("DayPeriodNarrowAM", "a. m.")]
[InlineData("DayPeriodNarrowPM", "p. m.")]
[InlineData("DayPeriodShortAM", "a. m.")]
[InlineData("DayPeriodShortPM", "p. m.")]
public void doesUnicodeIcuUseANonBreakingSpace(string caseNameForDisplay, string stringToCheck)
{
    var testChar = stringToCheck[2];
    var asInt = (int)testChar;
    asInt.Should().Be(32);
    testChar.Should().Be(' ');
}

Some of those tests fail, reporting that a character code of 8239 -- a unicode non-breaking space is being used.

image

I think this maps to the AMDesignator and PMDesignator fields -- these tests will fail:

[Fact]
public void CultureInfo_UsesTheWrongThingForAM()
{
    var cultureInfo = CultureInfo.GetCultureInfo("es-ES");
    cultureInfo.DateTimeFormat.AMDesignator.Should().Be("a. m.");
}

[Fact]
public void CultureInfo_UsesTheWrongThingForPM()
{
    var cultureInfo = CultureInfo.GetCultureInfo("es-ES");
    cultureInfo.DateTimeFormat.PMDesignator.Should().Be("p. m.");
}

Theory

Given the test results and the likelihood of mapping, my current theory is that:

Expected behavior

DateTime.ToString("tt") for a CultureInfo of es-ES will map to a. m. or p. m. (with an ASCII space).

Actual behavior

DateTime.ToString("tt") for a CultureInfo of es-ES outputs a. m. or p. m. using an ASCII non-breaking space for the space after the first .

Regression?

I'm not sure.

Known Workarounds

Configuration

Other information

No response

ghost commented 1 year ago

Tagging subscribers to this area: @dotnet/area-system-globalization See info in area-owners.md if you want to be subscribed.

Issue Details
### Description As part of a client project using .NET 6, I am displaying a date/time in the `es-ES` locale. I ran into an issue where a unit test around the time formatting failed, but the two strings looked exactly alike. So I started digging and I think it's possible there may be a small issue (or it may be by design). ### Reproduction Steps In a .NET 6 with xUnit and FluentAssertions: See the formatting issue: ```csharp [Fact] public void TimeShowsUpCorrectlyForES() { var cultureInfo = CultureInfo.GetCultureInfo("es-ES"); var testDate = new DateTime(2022, 10, 27, 12, 0, 0); var formattedString = testDate.ToString("tt", cultureInfo); // This will fail, even though the string appears to the eye to match. formattedString.Should().Be("p. m."); // I typed "p. m." by hand there, as I would in a unit test } ``` So then I figured I'd check the character: ```csharp [Fact] public void PMUsesAscii32() { var cultureInfo = CultureInfo.GetCultureInfo("es-ES"); var testDate = new DateTime(2022, 10, 27, 12, 0, 0); var formattedString = testDate.ToString("tt", cultureInfo); var spaceCharInString = formattedString[2]; var spaceCharAsInt = (int)spaceCharInString; // Fails -- Result is 160, an ASCII Non-breaking space spaceCharAsInt.Should().Be(32); } ``` I know as of .NET 5 (IIRC), standardization was moved to ICU. So I looked at the ICU Locale file: https://github.com/unicode-org/icu/blob/main/icu4c/source/data/locales/es.txt And I copied / pasted those values into `InlineData` in another test: ```csharp [Theory] [InlineData("GregorianAm", "a. m.")] [InlineData("GregorianPm", "p. m.")] [InlineData("MarkersAm", "a. m.")] [InlineData("MarkersPm", "p. m.")] [InlineData("NarrowAm", "a. m.")] [InlineData("NarrowPm", "p. m.")] [InlineData("StandaloneAbbrevAM", "a. m.")] [InlineData("StandaloneAbbrevPm", "p. m.")] [InlineData("NarrowAM", "a. m.")] [InlineData("NarrowPM", "p. m.")] [InlineData("WideAM", "a. m.")] [InlineData("WidePM", "a. m.")] [InlineData("DayPeriodAM", "a. m.")] [InlineData("DayPeriodPM", "p. m.")] [InlineData("DayPeriodNarrowAM", "a. m.")] [InlineData("DayPeriodNarrowPM", "p. m.")] [InlineData("DayPeriodShortAM", "a. m.")] [InlineData("DayPeriodShortPM", "p. m.")] public void doesUnicodeIcuUseANonBreakingSpace(string caseNameForDisplay, string stringToCheck) { var testChar = stringToCheck[2]; var asInt = (int)testChar; asInt.Should().Be(32); testChar.Should().Be(' '); } ``` Some of those tests fail, reporting that a character code of 8239 -- a unicode non-breaking space is being used. > ![image](https://user-images.githubusercontent.com/2148318/198484771-3866d418-5dc5-421b-aa29-4a834f58f9c4.png) I think this maps to the AMDesignator and PMDesignator fields -- these tests will fail: ```csharp [Fact] public void CultureInfo_UsesTheWrongThingForAM() { var cultureInfo = CultureInfo.GetCultureInfo("es-ES"); cultureInfo.DateTimeFormat.AMDesignator.Should().Be("a. m."); } [Fact] public void CultureInfo_UsesTheWrongThingForPM() { var cultureInfo = CultureInfo.GetCultureInfo("es-ES"); cultureInfo.DateTimeFormat.PMDesignator.Should().Be("p. m."); } ``` ## Theory Given the test results and the likelihood of mapping, my current theory is that: * The ICU formatting is being used, e.g. https://github.com/unicode-org/icu/blob/main/icu4c/source/data/locales/es.txt#L945 * The "standalone abbreviated" section is being used, e.g. mapped into `AMDesignator` and `PMDesignator` fields. * ❓ This is something I'm unsure of, as I can't figure out where that mapping is actually happening * There's a unicode non-breaking space in that * Somewhere along the line, it's mapped to an ASCII non-breaking space * That makes its away into the `AMDesignator` / `PMDesignator` aspects of `CultureInfo` * I see the head-scratching result when asserting against a date format of `tt` for that culture. ### Expected behavior `DateTime.Format("tt")` for a `CultureInfo` of `es-ES` will map to `a. m.` or `p. m.` (with an ASCII space). ### Actual behavior `DateTime.Format("tt")` for a `CultureInfo` of `es-ES` outputs `a. m.` or `p. m.` using an ASCII non-breaking space for the space after the first `.` ### Regression? I'm not sure. ### Known Workarounds * Replace the non-breaking space with a standard space character prior to return, after hitting this issue. * Update the assertion in the unit test by copying the actual result into the expectation, including the non-breaking space. ### Configuration * .NET 6: * Windows 11 (though I think it's application to any version after ICU globalization formats were introduced) ### Other information _No response_
Author: SeanKilleen
Assignees: -
Labels: `area-System.Globalization`
Milestone: -
Clockwork-Muse commented 1 year ago

Setting aside the culture data question, your actual problem is this:

I ran into an issue where a unit test around the time formatting failed

You shouldn't be writing tests with dependencies on data that

  1. You don't control
  2. Are part of the machine environment (There's various ways to provide customized ICU data, especially on Linux)
  3. Can be changed at any time (and Windows has a different release cadence of ICU updates than Linux)

Generally testing output falls into one of two camps:

  1. Ensure non-empty output for tests against a "live"/environment-provided culture
  2. Test exact output against a controlled, static, culture to ensure that the the expected formatting element is being used. The goal being more to ensure that a particular field is using ShortTimePattern, not that it formats it as 4AM or whatever. This can either be done with CultureInfo.Invariant, or by constructing a testing-specific custom culture (Microsoft actually uses this to ensure that UI elements even have localization set up).
SeanKilleen commented 1 year ago

@Clockwork-Muse typically I'd agree but in this case, this very specific formatting is required for an output for users regardless of what machine they or we are operating on. It goes into an email for users. I happened to be writing this code and these tests specifically because another developer had gotten the formatting wrong, so I assure you it's a valid test case.

My client is an international organization that in some cases needs things to default to a specific culture. One of the reasons we have this test is to ensure the output will be as we expect on whatever environment we run the tests on. it's not foolproof, but it's a much higher degree of confidence.

Please note that my tests accomplished exactly what they needed to, with the exception of discovering the non-breaking space. All the test cases in this issue are simplified for reproduction purposes.

So I'd like not to set aside the culture data question, but rather focus on it. 👍

Seems the possible outcomes here will be one of the following:

Regardless, I hope at the end of this I'll understand:

tarekgh commented 1 year ago

Getting NBSP inside the AM/PM designators is what CLDR decided to do. You may look at https://unicode-org.atlassian.net/jira/software/c/projects/CLDR/issues/CLDR-11469?jql=project%20%3D%20%22CLDR%22%20AND%20text%20~%20%22NBSP%22%20ORDER%20BY%20created%20DESC for that.

.NET not directly mapping to the CLDR but this mapping is done on the level of the ICU native library APIs. In .NET we are calling the correct API to get the needed data as show here. UDAT_AM_PMS.

In summary, .NET is working as expected and the data returned is the expected data according to CLDR. I am closing the issue but feel free to send any question you think we can help with. Thanks for the report.

SeanKilleen commented 1 year ago

Thanks for clarifying, @tarekgh, and pointing me to the supporting information! Much appreciated. 👍

Clockwork-Muse commented 1 year ago

this very specific formatting is required for an output for users regardless of what machine they or we are operating on. ... One of the reasons we have this test is to ensure the output will be as we expect on whatever environment we run the tests on.

If it's that specific, you should be constructing a permanent, static, culture with the specific formatting requirements, not pulling from environment data.

My client is an international organization that in some cases needs things to default to a specific culture.

This is a separate concern, unrelated to the specific formatting observed. Some languages/regions more tightly control formatting, and if you want your application to update to have the new data, then you should only be testing that the expected default culture is passed (eg, comparing the culture id), not what the formatting output it provides is.