Open blueset opened 6 years ago
I don't understand Japanese at all. Can you elaborate more?
Is Kanji-Kana relation context sensitive? Is it safe to extract all the Kanji sequence and transform them one by one?
Is it possible that compound word leads ambiguous result? like this:
(判じ絵, はじんじえ)
-> (判, はじん), じ, (絵, え)
-> (判, は), じ, (絵, んじえ)
A possible way to deal with the issue could be:
if the line of lyrics is detected as Japanese text:
tokenize the line
for each token in the line:
transliterate the token
convert all katakanas to hiraganas in the original token
if transliterated hiragana string != converted original string:
replace all unmatched substrings in original string with a `(.+)`, and enclose it with `^` and `$` (RegExp)
match the pattern against the transliterated string to get all kanji-kana pairs
prepare the token for rendering.
Terminologies
Background
When transliteration is taken from from Mac's builtin tokenizer, the tokens provided is based on words or word segments based on context. When directly applying it to Japanese Furigana, there will be redundant Furigana for Kanas as it's meaningless to annotate Kanas on themselves.
Expected behavior
Kanas should be stripped off from the Furigana text, and annotate on kanjis accordingly.
Note
In some cases, the tokenizer could give a token for a compound word, like
(繰り返し, くりかえし)
. In this case, the expected Furigana notation should be, as shown on the screenshot above,(繰, く), り, (返, かえ), し
.Some more examples:
(信じ, しんじ)
->(信, しん), じ
(見つけ, みつけ)
->(見, み), つけ
(繰り返し, くりかえし)
->(繰, く), り, (返, かえ), し
(判じ絵, はんじえ)
->(判, はん), じ, (絵, え)
(ムラサキ色, むらさきいろ)
->ムラサキ, (色, いろ)
(ハッと, はっと)
->ハッと