Ajatt-Tools / mecab_controller

🍣 Mecab wrapper to generate furigana readings.
https://tatsumoto.neocities.org/blog/join-our-community.html
GNU Affero General Public License v3.0
9 stars 3 forks source link

Hiragana conversion issue #1

Open saikotek opened 2 years ago

saikotek commented 2 years ago

Hello, I came here after using a great Anki plugin of yours called PitchAccent. I've noticed the issue when trying to convert pitch pattern to hiragana that it doesn't handle long vowel mark ー properly. Turns out that it isn't that easy to convert katakana to hiragana because of the fact that there are two ways to make vowel longer. If we would simply try to reverse "ー" character based on the preceding vowel it would make words like せんせえ (if the original data is written as センセー).

It would be the best to reverse the conversion workflow, make accents originally in hiragana and then it would be possible to convert to katakana deterministically, right? For that you need to have the original data in hiragana but from what I've seen the accent_dict data contains fields only in katakana, perhaps you cut out hiragana fields?

I prefer to use hiragana in pitch pattern so I can simply use that instead of vocab kana field in Anki. If it's too hard - don't mind it. Thanks for your hard work. よろしくお願いいたします。

tatsumoto-ren commented 2 years ago

Hello. The kana conversion module doesn't do anything to the character. It only converts kana characters. センセー becomes せんせー after conversion which I think is correct.

the accent_dict data contains fields only in katakana

It was originally this way. The pitch accent data used in the add-on was contributed by javdejong back in 2012.

If what you need is converting セー to せえ (and I assume other similar pairs), we could think about how to implement it, but it's not the issue of the kana converter module.

saikotek commented 2 years ago

I see. I believe then it could implemented by instead of doing to_katakana() conversion, convert kanji to furigana?

tatsumoto-ren commented 2 years ago

I don't think converting kanji to furigana is necessary. Having a simple dictionary that would map kana pairs would be the most obvious solution.

E.g.:

etc.

saikotek commented 2 years ago

Yeah but is "ー" always used to mark long vowel in おう、えい and not in おお、いい、ええ?

tatsumoto-ren commented 2 years ago

Hard to tell. We need a set of examples to draw a conclusion on how to do the conversion.