Find out exact ordering of hiragana/katakana letters in native apple platforms

mkhamoyan commented 1 year ago

While working on https://github.com/dotnet/runtime/pull/85965 we find out that ordering of hiragana/katakana letters in native apple platforms is not so clear. There are 3 cases

Letters that have small equivalent For this case ordering works like `hiragana small letter < katakana small letter < hiragana letter < katakana letter`	code	char
\u3041	ぁ	Hiragana letter small A
\u3042	あ	Hiragana letter A
\u30A1	ァ	Katakana letter small A
\u30A2	ア	Katakana letter A
--	--	--

Letters without small equivalent For this case ordering is `katakana letter < hiragana letter` but not sure it comes after small katakana letters or somewhere else.	code	char	name
\u30C0	ダ	Katakana letter DA
\u3060	だ	Hiragana letter DA
--	--	--	--

Letters only existing in katakana Not sure where these letters are in ordering.

code char name

\u30F4 ヴ Katakana letter VU

-- -- --

code	char	name
\u30F4	ヴ	Katakana letter VU
--	--	--

Find out what is the exact flow of ordering and update hybrid-globalization.md for OSX platform CompareString function details and add more test cases showing the ordering.

Contributes to https://github.com/dotnet/runtime/issues/80689

ghost commented 1 year ago

Tagging subscribers to 'os-ios': @steveisok, @akoeplinger See info in area-owners.md if you want to be subscribed.

Issue Details

While working on https://github.com/dotnet/runtime/pull/85965 we find out that ordering of hiragana/katakana letters in native apple platforms is not so clear. There are 3 cases 1. Letters that have lowercase/uppercase For this case ordering works like `hiragana lowercase < katakana lowercase < hiragana uppercase < katakana uppercase` |code | char | name -- | -- | -- \u3041 | ぁ | Hiragana letter small A \u3042 | あ | Hiragana letter A \u30A1 | ァ | Katakana letter small A \u30A2 | ア | Katakana letter A -- | -- | -- 2. Letters without lowercase/uppercase For this case ordering is `katakana letter < hiragana letter` but not sure it comes after lowercase katakana letters or somewhere else. |code | char | name -- | -- | -- \u30C0 | E38380 | ダ | Katakana letter DA \u3060 | E381A0 | だ | Hiragana letter DA -- | -- | -- | -- 3. Letters only existing in katakana Not sure where these letters are in ordering. |code | char | name -- | -- | -- \u30F4 | ヴ | Katakana letter VU -- | -- | -- Find out what is the exact flow of ordering and update `hybrid-globalization.md` for `OSX` platform `CompareString` function details and add more test cases showing the ordering. Contributes to https://github.com/dotnet/runtime/issues/80689

Author:	mkhamoyan
Assignees:	mkhamoyan
Labels:	`area-System.Globalization`, `os-ios`, `os-tvos`, `os-maccatalyst`
Milestone:	-

Clockwork-Muse commented 1 year ago

Letters that have lowercase/uppercase

There's no such thing as "case" (in the english/latin sense; I guess they might still be designated that way in unicode, but I somehow doubt it) for kana. Small characters are normally used to modify the sounds of "normal sized" characters. For instance:

きよ　(ki yo) -> きょ (kyo)

You're not supposed to have small characters on their own - formally that doesn't make any sense.

Letters without lowercase/uppercase

I'm not sure why you're stating the ordering of hiragana/katakana flipped here, unless it's something specific to the original test data? Except looking at the unicode blocks there's a complete match, so it should be possible to do this via offset from start of block (Assuming an interleaved/phonetic ordering, rather than just as two separate blocks).

Note that the two characters chosen as an example have the ゛ (ten-ten) marks, which change the sound of the characters, as part of the character, but there's also additional separate and combining character versions. That is, there's both だ and だ, which are separate unicode sequences!

Letters only existing in katakana

The unicode block actually shows an equivalent entry for a hiragana character, ゔ. There some additional entries or other differences between the two blocks (the ten-ten marks are part of the hiragana block, for example).

I don't actually recognize everything in the blocks - they aren't in the commonly taught syllabary (at minimum when learning Japanese as a foreign language, but I don't think in Japanese schools either). I'm not sure of some of the uses of some of the characters.

Also, as an additional wrinkle, for historical reasons there's a half-width katakana block. Note this block does not include the pre-combined versions of characters, and has an additional set of (half-width) combining characters.

mkhamoyan commented 1 year ago

Thanks for the explanation. My bad for using lowercase/uppercase words, changed to small letter.

I wanted to give examples where ICU compareString and apple native compareString functions return different results while comparing hiragana and katakana letters. We had test cases that expect for example \u30C0 ダ to be before \u3060 だ while using ICU , but apple native compareString function returns different ordering result.

It is known that on Windows's NLS hiragana characters sort after katakana , on ICU it is the opposite. This task is created to investigate how hiragana and katakana characters are sorted on apple platforms.

tarekgh commented 11 months ago

@mkhamoyan @steveisok could you please triage this issue? Thanks!

dotnet / runtime

Find out exact ordering of hiragana/katakana letters in native apple platforms #86636