ddddxxx / LyricsX

🎶 Ultimate lyrics app for macOS.
Mozilla Public License 2.0
4.78k stars 392 forks source link

Redundant Furigana on kanas for Japanese lyrics #183

Open blueset opened 6 years ago

blueset commented 6 years ago

Terminologies

Background

When transliteration is taken from from Mac's builtin tokenizer, the tokens provided is based on words or word segments based on context. When directly applying it to Japanese Furigana, there will be redundant Furigana for Kanas as it's meaningless to annotate Kanas on themselves.

Expected behavior

Kanas should be stripped off from the Furigana text, and annotate on kanjis accordingly.

image Screenshot from LyricsX 1.4.1 (1846)

image Screenshot of the same song in a Karaoke video

Note

In some cases, the tokenizer could give a token for a compound word, like (繰り返し, くりかえし). In this case, the expected Furigana notation should be, as shown on the screenshot above, (繰, く), り, (返, かえ), し.

Some more examples:

ddddxxx commented 6 years ago

I don't understand Japanese at all. Can you elaborate more?

  1. Is Kanji-Kana relation context sensitive? Is it safe to extract all the Kanji sequence and transform them one by one?

  2. Is it possible that compound word leads ambiguous result? like this: (判じ絵, はじんじえ) -> (判, はじん), じ, (絵, え) -> (判, は), じ, (絵, んじえ)

blueset commented 6 years ago
  1. Yes, it is context-sensitive. Kanjis cannot be extracted before conversion, otherwise it will produce a greatly inaccurate conversion.
  2. It could be possible that the compound word leads to ambiguous result, despite extremely rare. If that actually happens, you may want to take any of the possible results, as there is nothing more can be improved without introducing a dedicated NLP library.

A possible way to deal with the issue could be:

if the line of lyrics is detected as Japanese text:
    tokenize the line
    for each token in the line:
        transliterate the token
        convert all katakanas to hiraganas in the original token
        if transliterated hiragana string != converted original string:
            replace all unmatched substrings in original string with a `(.+)`, and enclose it with `^` and `$` (RegExp)
            match the pattern against the transliterated string to get all kanji-kana pairs
        prepare the token for rendering.