Open mkhamoyan opened 1 year ago
Tagging subscribers to 'os-ios': @steveisok, @akoeplinger See info in area-owners.md if you want to be subscribed.
Author: | mkhamoyan |
---|---|
Assignees: | mkhamoyan |
Labels: | `area-System.Globalization`, `os-ios`, `os-tvos`, `os-maccatalyst` |
Milestone: | - |
Letters that have lowercase/uppercase
There's no such thing as "case" (in the english/latin sense; I guess they might still be designated that way in unicode, but I somehow doubt it) for kana. Small characters are normally used to modify the sounds of "normal sized" characters. For instance:
きよ (ki yo) -> きょ (kyo)
You're not supposed to have small characters on their own - formally that doesn't make any sense.
Letters without lowercase/uppercase
I'm not sure why you're stating the ordering of hiragana/katakana flipped here, unless it's something specific to the original test data? Except looking at the unicode blocks there's a complete match, so it should be possible to do this via offset from start of block (Assuming an interleaved/phonetic ordering, rather than just as two separate blocks).
Note that the two characters chosen as an example have the ゛
(ten-ten) marks, which change the sound of the characters, as part of the character, but there's also additional separate and combining character versions. That is, there's both だ
and だ
, which are separate unicode sequences!
Letters only existing in katakana
The unicode block actually shows an equivalent entry for a hiragana character, ゔ
. There some additional entries or other differences between the two blocks (the ten-ten marks are part of the hiragana block, for example).
I don't actually recognize everything in the blocks - they aren't in the commonly taught syllabary (at minimum when learning Japanese as a foreign language, but I don't think in Japanese schools either). I'm not sure of some of the uses of some of the characters.
Also, as an additional wrinkle, for historical reasons there's a half-width katakana block. Note this block does not include the pre-combined versions of characters, and has an additional set of (half-width) combining characters.
Thanks for the explanation. My bad for using lowercase/uppercase words, changed to small letter.
I wanted to give examples where ICU compareString
and apple native compareString
functions return different results while comparing hiragana
and katakana
letters.
We had test cases that expect for example
\u30C0 ダ
to be before \u3060 だ
while using ICU
, but apple native compareString function
returns different ordering result.
It is known that on Windows's NLS
hiragana characters sort after katakana , on ICU
it is the opposite.
This task is created to investigate how hiragana
and katakana
characters are sorted on apple platforms
.
@mkhamoyan @steveisok could you please triage this issue? Thanks!
While working on https://github.com/dotnet/runtime/pull/85965 we find out that ordering of hiragana/katakana letters in native apple platforms is not so clear. There are 3 cases
hiragana small letter < katakana small letter < hiragana letter < katakana letter
katakana letter < hiragana letter
but not sure it comes after small katakana letters or somewhere else.Letters only existing in katakana Not sure where these letters are in ordering.
Find out what is the exact flow of ordering and update
hybrid-globalization.md
forOSX
platformCompareString
function details and add more test cases showing the ordering.Contributes to https://github.com/dotnet/runtime/issues/80689