dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.07k stars 4.69k forks source link

Support Unicode Script character classes in regular expressions #57038

Open TheAstrologic opened 3 years ago

TheAstrologic commented 3 years ago

I noticed in .NET, there seems to be no support for the regex scripts for Kanji. There's \p{IsCJKUnifiedIdeographs}, but that will also find Chinese and Korean characters, and not Kanji specifically. I noticed other versions use "Han" - which includes Chinese characters and Kanji. I don't know if there is a difference between Kanji and Chinese characters, but I do believe there's simplified Chinese, and I do feel if there are distinctions to be made, then there are distinctions to be reflected within the \p scripts.

Sorry if I didn't submit this in the right spot, I got sick of trying to match the criteria of other formats and settled with this one.

dotnet-issue-labeler[bot] commented 3 years ago

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

ghost commented 3 years ago

Tagging subscribers to this area: @eerhardt, @dotnet/area-system-text-regularexpressions See info in area-owners.md if you want to be subscribed.

Issue Details
I noticed in .NET, there seems to be no support for the regex scripts for Kanji. There's \p{IsCJKUnifiedIdeographs}, but that will also find Chinese and Korean characters, and not Kanji specifically. I noticed other versions use "Han" - which includes Chinese characters and Kanji. I don't know if there is a difference between Kanji and Chinese characters, but I do believe there's simplified Chinese, and I do feel if there are distinctions to be made, then there are distinctions to be reflected within the \p scripts. Sorry if I didn't submit this in the right spot, I got sick of trying to match the criteria of other formats and settled with this one.
Author: TheAstrologic
Assignees: -
Labels: `area-System.Text.RegularExpressions`, `untriaged`
Milestone: -
danmoseley commented 3 years ago

I noticed other versions use "Han"

@TheAstrologic could you point to docs on another engine that uses Han, and ideally what block(s) it maps to?

eg., I do not see it in the list for Perl, which is generally a super set of what we have. https://perldoc.perl.org/perluniprops#Properties-accessible-through-%5Cp%7B%7D-and-%5CP%7B%7D

cc @tarekgh who may know what the named Unicode range is that covers Kanji.

DaZombieKiller commented 3 years ago

@danmoseley

eg., I do not see it in the list for Perl, which is generally a super set of what we have. https://perldoc.perl.org/perluniprops#Properties-accessible-through-%5Cp%7B%7D-and-%5CP%7B%7D

I searched the page for {Han} and this popped up:

\p{Script_Extensions: Han} (Short: \p{Scx=Han}, \p{Han}) (94_492:
                          U+2E80..2E99, U+2E9B..2EF3,
                          U+2F00..2FD5, U+3001..3003,
                          U+3005..3011, U+3013..301F ...)
TheAstrologic commented 3 years ago

image

Joe4evr commented 3 years ago

I don't know if there is a difference between Kanji and Chinese characters

Not an expert in this field, but far as I know there is no difference there. I think "simplified Chinese" could be mostly the subset of characters under a certain amount of strokes, but that's a bit of a guess.

huoyaoyuan commented 3 years ago

I don't know if there is a difference between Kanji and Chinese characters

The CJKV Unified Ideographs are meant to use same code point for the same character from different languages. It's meant to be undistinguishable if I've understood correctly.

danmoseley commented 3 years ago

@DaZombieKiller I am not knowledgeable about this. But that apparently is a script extension, not a block. What we support today seems to be only Unicode categories, and Unicode blocks.

Eg., we support \p{IsGurmukhi} which maps to \u0A00\u0A79 (note that the datastructures in the code always record the start through one past the end.)

Perl apparently supports this as \p{Block: Gurmukhi}

 \p{Block: Gurmukhi}     (NOT \p{Gurmukhi} NOR \p{Is_Gurmukhi})
                            (128: U+0A00..0A7F)

(Unicode also defines this block as 0A00-0A7F so it may be time to update our tables here.)

Perl supports \p{Gurmukhi} for the Unicode script range

  \p{Gurmukhi}            \p{Script_Extensions=Gurmukhi} (Short:
                            \p{Guru}; NOT \p{Block=Gurmukhi}) (94)
\p{Script: Gurmukhi}    (Short: \p{Sc=Guru}) (80: U+0A01..0A03,
                            U+0A05..0A0A, U+0A0F..0A10,
                            U+0A13..0A28, U+0A2A..0A30, U+0A32..0A33
                            ...)
  \p{Script_Extensions: Gurmukhi} (Short: \p{Scx=Guru}, \p{Guru})
                            (94: U+0951..0952, U+0964..0965,
                            U+0A01..0A03, U+0A05..0A0A,
                            U+0A0F..0A10, U+0A13..0A28 ...)

Interestingly the Unicode regex document, it recommends that for regular expressions "Script values are generally preferred to Block values" due to issues it enumerates. So Perl seems to be doing the right thing.

The .NET regex approach probably dates back 20 years. I wonder whether we should now support script properties and if so how we would disambiguate with blocks without changing the existing meaning of patterns. Happily we're currently prefixing with "Is". So maybe \p{Gurmukhi} could be introduced for the script and the existing \p{IsGurmukhi} remains the block.

The full Unicode definition of Han is

2E80..2E99    ; Han # So  [26] CJK RADICAL REPEAT..CJK RADICAL RAP
2E9B..2EF3    ; Han # So  [89] CJK RADICAL CHOKE..CJK RADICAL C-SIMPLIFIED TURTLE
2F00..2FD5    ; Han # So [214] KANGXI RADICAL ONE..KANGXI RADICAL FLUTE
3005          ; Han # Lm       IDEOGRAPHIC ITERATION MARK
3007          ; Han # Nl       IDEOGRAPHIC NUMBER ZERO
3021..3029    ; Han # Nl   [9] HANGZHOU NUMERAL ONE..HANGZHOU NUMERAL NINE
3038..303A    ; Han # Nl   [3] HANGZHOU NUMERAL TEN..HANGZHOU NUMERAL THIRTY
303B          ; Han # Lm       VERTICAL IDEOGRAPHIC ITERATION MARK
3400..4DBF    ; Han # Lo [6592] CJK UNIFIED IDEOGRAPH-3400..CJK UNIFIED IDEOGRAPH-4DBF
4E00..9FFC    ; Han # Lo [20989] CJK UNIFIED IDEOGRAPH-4E00..CJK UNIFIED IDEOGRAPH-9FFC
F900..FA6D    ; Han # Lo [366] CJK COMPATIBILITY IDEOGRAPH-F900..CJK COMPATIBILITY IDEOGRAPH-FA6D
FA70..FAD9    ; Han # Lo [106] CJK COMPATIBILITY IDEOGRAPH-FA70..CJK COMPATIBILITY IDEOGRAPH-FAD9
16FF0..16FF1  ; Han # Mc   [2] VIETNAMESE ALTERNATE READING MARK CA..VIETNAMESE ALTERNATE READING MARK NHAY
20000..2A6DD  ; Han # Lo [42718] CJK UNIFIED IDEOGRAPH-20000..CJK UNIFIED IDEOGRAPH-2A6DD
2A700..2B734  ; Han # Lo [4149] CJK UNIFIED IDEOGRAPH-2A700..CJK UNIFIED IDEOGRAPH-2B734
2B740..2B81D  ; Han # Lo [222] CJK UNIFIED IDEOGRAPH-2B740..CJK UNIFIED IDEOGRAPH-2B81D
2B820..2CEA1  ; Han # Lo [5762] CJK UNIFIED IDEOGRAPH-2B820..CJK UNIFIED IDEOGRAPH-2CEA1
2CEB0..2EBE0  ; Han # Lo [7473] CJK UNIFIED IDEOGRAPH-2CEB0..CJK UNIFIED IDEOGRAPH-2EBE0
2F800..2FA1D  ; Han # Lo [542] CJK COMPATIBILITY IDEOGRAPH-2F800..CJK COMPATIBILITY IDEOGRAPH-2FA1D
30000..3134A  ; Han # Lo [4939] CJK UNIFIED IDEOGRAPH-30000..CJK UNIFIED IDEOGRAPH-3134A

Does that look like what you need?

cc @tarekgh @GrabYourPitchforks to check I'm correctly understanding this block/script distinction.

danmoseley commented 3 years ago

Also note that part of the script ranges above include 3 byte code points. I believe the engine currently operates only on char which is only 2 byte. To handle those we would need a hypothetical future engine that operates on UTF-8 strings/buffers. Is that right @GrabYourPitchforks ?

tarekgh commented 3 years ago

The CJK Unified Ideographs doc has the ranges supported by Unicode.

What @danmoseley mentioned in the comment is the complete Han script including the unified Ideograph, but it has more than that (like CJK Compatibility Ideographs). The details are mentioned in the Section 18.1 Han

danmoseley commented 3 years ago

I think the next step here is for someone to investigate what it would take to support Unicode scripts and make a proposal (presumably, broader than just support Han). My guess is that it would not need extensive changes, as these just boil down to more character classes.

https://www.regular-expressions.info/unicode.html#category is helpful here.

Re the format, it seems Perl supports both \p{IsScript} and \p{IsBlock} and \p{Category} so perhaps that's the right approach for .NET. This is helpful: https://www.regular-expressions.info/refunicode.html

GrabYourPitchforks commented 3 years ago

@danmoseley @tarekgh I'm not sure what capabilities our current Regex class has with respect to supplementary plane code points (everything U+10000 and above). However, even if support for supplementary plane code points isn't on the horizon, I'd expect it'd be not too much work to support the following ranges, since they all cleanly fit into a single char.

2E80..2E99    ; Han # So  [26] CJK RADICAL REPEAT..CJK RADICAL RAP
2E9B..2EF3    ; Han # So  [89] CJK RADICAL CHOKE..CJK RADICAL C-SIMPLIFIED TURTLE
2F00..2FD5    ; Han # So [214] KANGXI RADICAL ONE..KANGXI RADICAL FLUTE
3005          ; Han # Lm       IDEOGRAPHIC ITERATION MARK
3007          ; Han # Nl       IDEOGRAPHIC NUMBER ZERO
3021..3029    ; Han # Nl   [9] HANGZHOU NUMERAL ONE..HANGZHOU NUMERAL NINE
3038..303A    ; Han # Nl   [3] HANGZHOU NUMERAL TEN..HANGZHOU NUMERAL THIRTY
303B          ; Han # Lm       VERTICAL IDEOGRAPHIC ITERATION MARK
3400..4DBF    ; Han # Lo [6592] CJK UNIFIED IDEOGRAPH-3400..CJK UNIFIED IDEOGRAPH-4DBF
4E00..9FFC    ; Han # Lo [20989] CJK UNIFIED IDEOGRAPH-4E00..CJK UNIFIED IDEOGRAPH-9FFC
F900..FA6D    ; Han # Lo [366] CJK COMPATIBILITY IDEOGRAPH-F900..CJK COMPATIBILITY IDEOGRAPH-FA6D
FA70..FAD9    ; Han # Lo [106] CJK COMPATIBILITY IDEOGRAPH-FA70..CJK COMPATIBILITY IDEOGRAPH-FAD9
ufcpp commented 3 years ago

As some mentioned , Unicode has unified Han characters between Japanese, simplified Chinese, and traditional Chinese. "Kanji" in Japanese and "Han" in Unicode terminology are the same thing, but many characters in Unicode's Han script blocks are only used in one of Japanese, simplified Chinese, or traditional Chinese.

@TheAstrologic might be looking for a way to determine the Japanese Kanji. That's not an easy task, but it seems to be possible to make an heuristic determination based on UAX#38 (Unihan Database). However, it is not responsibility of the Regex.