dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.52k stars 4.53k forks source link

Find out exact ordering of hiragana/katakana letters in native apple platforms #86636

Open mkhamoyan opened 1 year ago

mkhamoyan commented 1 year ago

While working on https://github.com/dotnet/runtime/pull/85965 we find out that ordering of hiragana/katakana letters in native apple platforms is not so clear. There are 3 cases

  1. Letters that have small equivalent For this case ordering works like hiragana small letter < katakana small letter < hiragana letter < katakana letter code char name
    \u3041 Hiragana letter small A
    \u3042 Hiragana letter A
    \u30A1 Katakana letter small A
    \u30A2 Katakana letter A
    -- -- --
  2. Letters without small equivalent For this case ordering is katakana letter < hiragana letter but not sure it comes after small katakana letters or somewhere else. code char name
    \u30C0 Katakana letter DA
    \u3060 Hiragana letter DA
    -- -- -- --
  3. Letters only existing in katakana Not sure where these letters are in ordering.

    code char name
    \u30F4 Katakana letter VU
    -- -- --

Find out what is the exact flow of ordering and update hybrid-globalization.md for OSX platform CompareString function details and add more test cases showing the ordering.

Contributes to https://github.com/dotnet/runtime/issues/80689

ghost commented 1 year ago

Tagging subscribers to 'os-ios': @steveisok, @akoeplinger See info in area-owners.md if you want to be subscribed.

Issue Details
While working on https://github.com/dotnet/runtime/pull/85965 we find out that ordering of hiragana/katakana letters in native apple platforms is not so clear. There are 3 cases 1. Letters that have lowercase/uppercase For this case ordering works like `hiragana lowercase < katakana lowercase < hiragana uppercase < katakana uppercase` |code | char | name -- | -- | -- \u3041 | ぁ | Hiragana letter small A \u3042 | あ | Hiragana letter A \u30A1 | ァ | Katakana letter small A \u30A2 | ア | Katakana letter A -- | -- | -- 2. Letters without lowercase/uppercase For this case ordering is `katakana letter < hiragana letter` but not sure it comes after lowercase katakana letters or somewhere else. |code | char | name -- | -- | -- \u30C0 | E38380 | ダ | Katakana letter DA \u3060 | E381A0 | だ | Hiragana letter DA -- | -- | -- | -- 3. Letters only existing in katakana Not sure where these letters are in ordering. |code | char | name -- | -- | -- \u30F4 | ヴ | Katakana letter VU -- | -- | -- Find out what is the exact flow of ordering and update `hybrid-globalization.md` for `OSX` platform `CompareString` function details and add more test cases showing the ordering. Contributes to https://github.com/dotnet/runtime/issues/80689
Author: mkhamoyan
Assignees: mkhamoyan
Labels: `area-System.Globalization`, `os-ios`, `os-tvos`, `os-maccatalyst`
Milestone: -
Clockwork-Muse commented 1 year ago

Letters that have lowercase/uppercase

There's no such thing as "case" (in the english/latin sense; I guess they might still be designated that way in unicode, but I somehow doubt it) for kana. Small characters are normally used to modify the sounds of "normal sized" characters. For instance:

きよ (ki yo) -> きょ (kyo)

You're not supposed to have small characters on their own - formally that doesn't make any sense.

Letters without lowercase/uppercase

I'm not sure why you're stating the ordering of hiragana/katakana flipped here, unless it's something specific to the original test data? Except looking at the unicode blocks there's a complete match, so it should be possible to do this via offset from start of block (Assuming an interleaved/phonetic ordering, rather than just as two separate blocks).

Note that the two characters chosen as an example have the (ten-ten) marks, which change the sound of the characters, as part of the character, but there's also additional separate and combining character versions. That is, there's both and だ, which are separate unicode sequences!

Letters only existing in katakana

The unicode block actually shows an equivalent entry for a hiragana character, . There some additional entries or other differences between the two blocks (the ten-ten marks are part of the hiragana block, for example).

I don't actually recognize everything in the blocks - they aren't in the commonly taught syllabary (at minimum when learning Japanese as a foreign language, but I don't think in Japanese schools either). I'm not sure of some of the uses of some of the characters.

Also, as an additional wrinkle, for historical reasons there's a half-width katakana block. Note this block does not include the pre-combined versions of characters, and has an additional set of (half-width) combining characters.

mkhamoyan commented 1 year ago

Thanks for the explanation. My bad for using lowercase/uppercase words, changed to small letter.

I wanted to give examples where ICU compareString and apple native compareString functions return different results while comparing hiragana and katakana letters. We had test cases that expect for example \u30C0 ダ to be before \u3060 だ while using ICU , but apple native compareString function returns different ordering result.

It is known that on Windows's NLS hiragana characters sort after katakana , on ICU it is the opposite. This task is created to investigate how hiragana and katakana characters are sorted on apple platforms.

tarekgh commented 11 months ago

@mkhamoyan @steveisok could you please triage this issue? Thanks!