dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.12k stars 4.7k forks source link

Compare string with IgnoreCase/Kana in ja-JP character failed in unix #15952

Closed kkurni closed 4 years ago

kkurni commented 8 years ago

There seems to be a bug when ignoring Kana Character in ja-JP locale.

Repro

private static CultureInfo[] _cultureInfo =
        {
            new CultureInfo("ar-SA"),  // Arabic - Saudi Arabia
            new CultureInfo("ja-JP"),  // Japanese - Japan
            new CultureInfo("de-DE"),  // German - Germany
            new CultureInfo("hi-IN"),  // Hindi - India
            new CultureInfo("tr-TR"),  // Turkish - Turkey
            new CultureInfo("th-TH"),  // Thai - Thailand
            new CultureInfo("el-GR"),  // Greek - Greece
            new CultureInfo("ru-RU"),  // Russian - Russia
            new CultureInfo("he-IL"),  // Hebrew - Israel
            new CultureInfo("cs-CZ"),  // Czech - Czech Republic
            new CultureInfo("fr-CH"),  // French - Switzerland
            new CultureInfo("en-US")   // English - United States
        };

 private static string[,] _specialMatchingString = new string[4, 2] {{"Lorem ipsum dolor sit amet", "Lorem ipsum dolor sit amet"},
                                                                         {"かたかな", "カタカナ"},
                                                                         {"ファズ・ギター", "ファズ・ギター"},
                                                                         {"engine", "eNGine"}};

 [Fact]
        public void CultureComparisonTests()
        {
            foreach (var cultureInfo in _cultureInfo)
            {
                for (var testNo = 0; testNo < _specialMatchingString.GetLength(0); testNo++)
                {
                    CultureComparisonTests(cultureInfo.Name, _specialMatchingString[testNo, 0], _specialMatchingString[testNo, 1]);
                }
            }
        }

        public void CultureComparisonTests(string localeName, string str1, string str2)
        {
            Console.WriteLine("CultureComparisonTests for {0} : {1} : {2}", localeName, str1, str2);
            var options = CompareOptions.IgnoreCase | CompareOptions.IgnoreKanaType | CompareOptions.IgnoreWidth;

            var comparisonInfo = new CultureInfo(localeName).CompareInfo;
            var result = comparisonInfo.Compare(str1, 0, str1.Length, str2, 0, str2.Length, options);

            Assert.True(result == 0, String.Format("{0} : {1} comparison should ignore case, kana type and width", str1, str2));
        }

Summary

Test this in UNIX

CultureComparisonTests("ja-JP", "ファズ・ギター" , "ファズ・ギター");

Expected to be true for comparison with ignore Kana and ignore case character

This only failed in

kkurni commented 8 years ago

CC @saurabh500 @corivera

ellismg commented 8 years ago

/cc @eerhardt. Eric just recently landed changes to improve our support for these flags, so it would be good to understand exactly what CoreCLR build you were testing with.

kkurni commented 8 years ago

I built using linux and run it on ubuntu14.0.4 msbuild /p:OSGroup=Linux

ellismg commented 8 years ago

Do you also build and copy over a custom coreclr version or do you just use whatever comes by default?

kkurni commented 8 years ago

I built this on Windows and copy the whole dnxcore50 to unix and run it in Unix.

and I didn't update coreclr for a while.

Does the fix is in CoreCLR ?

ellismg commented 8 years ago

Yes, the fix is in CoreCLR, not CoreFX. We will need to update the test runtime we use.

kkurni commented 8 years ago

I just update my coreCLR and built it again, but still getting the same issue.

eerhardt commented 8 years ago

I was able to repro this issue with the latest CoreCLR as well.

It appears the problem is coming from the last character in that line. What appears to be a dash.

Test this in UNIX fails

CultureComparisonTests("ja-JP", "ファズ・ギター" , "ファズ・ギター");

But this test in UNIX passes

CultureComparisonTests("ja-JP", "ファズ・ギタ" , "ファズ・ギタ");
kkurni commented 8 years ago

Good finding.

eerhardt commented 8 years ago

I've reproed this in a standalone native application on my Mac OSX El Capitan machine (ICU 55).

void CompareStrings(const char * localeName)
{
    UErrorCode err;
    UCollator *pCol = ucol_open(localeName, &err);

    ucol_setAttribute(pCol, UCOL_STRENGTH, UCOL_SECONDARY, &err);

    const UChar halfWidth[] = { 0xff80, 0xff70 };  // "ター"
    const UChar fullWidth[] = { 0x30bf, 0x30fc };  // "ター"

    int result = ucol_strcoll(pCol, halfWidth, 2, fullWidth, 2);

    printf("%d\n", result);

    ucol_close(pCol);
}

int main(int argc, const char * argv[])
{
    CompareStrings("en");
    CompareStrings("ja");
    CompareStrings("zh");
    CompareStrings("ko");
    return 0;
}

that code prints out:

0
-1
0
0

I'll be sending an email to ICU discussion on this issue, but I'm assuming this issue is going to get closed as "external" since this is the underlying ICU behavior.

kkurni commented 8 years ago

I reproed this on ubuntu 14.0.4. Does ubuntu use the same ICU 55 ? will it be the same for all unix ?

eerhardt commented 8 years ago

My Ubuntu 14.04 was using ICU 52, which is even older.

eerhardt commented 8 years ago

@markusicu replied on the ICU support email discussion:

This is the issue: halfwidth vs. fullwidth forms. The Japanese sort order has special rules for length-mark-after-syllable, but only for the regular length mark, not for its halfwidth form. It also does not seem to have a complete duplicate of the rules for the halfwidth syllable (the Ta) compared to its regular form. The data is in CLDR: http://unicode.org/cldr/trac/browser/trunk/common/collation/ja.xml It is curious that we have had this sort order for some twelve years but no one seems to have noticed or cared... If you want the halfwidth forms to be treated like the regular forms, then please submit a ticket at http://unicode.org/cldr/trac/newticket

I'm going to close this issue, since this needs to be fixed in the underlying CLDR data. Once that happens, if someone is using an ICU version with the fixed data, this bug will be fixed.

stephentoub commented 8 years ago

Thanks, @eerhardt. I assume you opened a ticket per the response?

eerhardt commented 8 years ago

Yes - I opened http://unicode.org/cldr/trac/ticket/9146 for the underlying CLDR data problem.