Redundant Furigana on kanas for Japanese lyrics

blueset commented 6 years ago

Terminologies

Kana: common terms for both Hiragana (ひらがな) and Katakana (カタカナ)
Kanji: Chinese characters (漢字)
Furigana: Kana, in small font size, annotated on text (commonly Kanji) to indicate its pronunciation. (There could be cases where a special pronunciation is used on some words by the lyrics author, but this is literally unachievable for automation notations.)

Background

When transliteration is taken from from Mac's builtin tokenizer, the tokens provided is based on words or word segments based on context. When directly applying it to Japanese Furigana, there will be redundant Furigana for Kanas as it's meaningless to annotate Kanas on themselves.

Expected behavior

Kanas should be stripped off from the Furigana text, and annotate on kanjis accordingly.

Screenshot from LyricsX 1.4.1 (1846)

Screenshot of the same song in a Karaoke video

Note

In some cases, the tokenizer could give a token for a compound word, like (繰り返し, くりかえし). In this case, the expected Furigana notation should be, as shown on the screenshot above, (繰, く), り, (返, かえ), し.

Some more examples:

(信じ, しんじ) -> (信, しん), じ
(見つけ, みつけ) -> (見, み), つけ
(繰り返し, くりかえし) -> (繰, く), り, (返, かえ), し
(判じ絵, はんじえ) -> (判, はん), じ, (絵, え)
(ムラサキ色, むらさきいろ) -> ムラサキ, (色, いろ)
(ハッと, はっと) -> ハッと

ddddxxx commented 6 years ago

I don't understand Japanese at all. Can you elaborate more?

Is Kanji-Kana relation context sensitive? Is it safe to extract all the Kanji sequence and transform them one by one?
Is it possible that compound word leads ambiguous result? like this: (判じ絵, はじんじえ) -> (判, はじん), じ, (絵, え) -> (判, は), じ, (絵, んじえ)

blueset commented 6 years ago

Yes, it is context-sensitive. Kanjis cannot be extracted before conversion, otherwise it will produce a greatly inaccurate conversion.
It could be possible that the compound word leads to ambiguous result, despite extremely rare. If that actually happens, you may want to take any of the possible results, as there is nothing more can be improved without introducing a dedicated NLP library.

A possible way to deal with the issue could be:

if the line of lyrics is detected as Japanese text:
    tokenize the line
    for each token in the line:
        transliterate the token
        convert all katakanas to hiraganas in the original token
        if transliterated hiragana string != converted original string:
            replace all unmatched substrings in original string with a `(.+)`, and enclose it with `^` and `$` (RegExp)
            match the pattern against the transliterated string to get all kanji-kana pairs
        prepare the token for rendering.

ddddxxx / LyricsX