PSeitz / wana_kana_rust

Utility library for checking and converting between Japanese characters - Hiragana, Katakana - and Romaji
MIT License
70 stars 13 forks source link

Extend is_kanji to recognise kanji in the CJK Unified Ideographs Extension blocks (or provide alternate function) #15

Open Heliozoa opened 8 months ago

Heliozoa commented 8 months ago

Currently, is_kanji uses the Unicode range U+4E00-U+9FAF to recognise kanji, corresponding to the CJK Unified Ideographs block. Unicode has additional "extension blocks" that contain more uncommon kanji, such as CJK Unified Ideographs Extension B which contains the kanji 𬵪.

Since these are quite obscure and possibly difficult to determine which of them qualify as "kanji", I think it would be useful to include such functionality in a crate.

PSeitz commented 8 months ago

Are you suggesting there should be a separate method for this? This crate is mostly for hiragana and katakana, is_kanji is just for convenience.

Heliozoa commented 8 months ago

That's understandable, covering just the CJK Unified Ideographs block is enough for most purposes I imagine.

For added context, I was using the crate for its other functionality already, and started using is_kanji to pick out kanji from the words contained in the JMdict dictionary. It contains some words that contain kanji from the extension blocks, so they were unexpectedly (to me) getting filtered out by is_kanji.

You can close the issue if this is out of scope, or leave it up if this is something that may have a place in the crate in the future. Thanks for the quick response!