Simplification of readings in JMdict

stephenmk commented 2 months ago

Previous discussion on the mailing list: https://www.edrdg.org/jmdict_edict_list/2021/msg00108.html

Until mid-2021, if the kanji field of an entry included both kanji and katakana for part of a form, e.g. アカバナ科 and 赤花科, then the reading field typically had matching kana forms, in this case アカバナか and あかばなか, with restrictions to align the kanji/readings pairs. This was done to assist with the generation of the legacy EDICT format. This is no longer a major issue and it is now considered acceptable to have a single reading (i.e. あかばなか) in such cases.

Comment received today from an anonymous contributor on entry 2862443:

This leads to very awkward or completely misleading readings, though. Using JMdict-based dictionaries for Yomichan/Yomitan gives "あかばな" as furigana for "アカバナ" in アカバナ科, which is ugly and redundant (and separates the entry from the other アカバナ科 entries which use the correct reading). Using jisho.org, アカバナ科 has "あかばなか" as the reading for "科".

The complaint is that the simplified readings policy has caused some apps to display furigana incorrectly. We discussed furigana recently and the outcome of that conversation was that it's up to the apps to implement furigana correctly.

I'm opening this issue in case anyone else has anything to add.

JMdictProject commented 2 months ago

Yes, it's really up to the apps/sites that want to display furigana to establish them properly. I appreciate that the reading simplifications in JMdict starting in 20201 make it harder to do it from the dictionary alone. There are other source. Issue #118 discusses some of them.

parfait8566 commented 2 months ago

Even assuming app/sites are able to establish some tool to display furigana properly, I think it'd desirable to have the JMdict data as accurate as possible in the first place.

This approach has led on occasion to some rather complex and ugly entries, and it's appropriate to ask whether it's really worth doing. Does it really matter? A recent example of this is the 喉が渇く entry (*), where some variants were added containing ノド in place of 喉 The reading part of that entry now contains: のどがかわく[喉が渇く,のどが渇く,喉が乾く,のどが乾く,喉がかわく]；ノドがかわく[ノドが渇く,ノドが乾く]

Could be just me, but I don't find this particularly complex or ugly. And if we're distinguishing between hiragana and katakana at all, to me it makes the most sense to separate the readings accordingly.

robinjmdict commented 2 months ago

There's really no reason to distinguish between hiragana and katakana in the readings field given that they're pronounced the same. I think our current approach is much better than what we had before, even if it makes generating furigana a little harder.

parfait8566 commented 2 months ago

If there's no reason to distinguish between hiragana and katakana in the readings field, why not just stick to one of them all the way through? As far as I can see, pretty much all other dictionaries do distinguish them even for readings. I think the benefits outweigh the possible inconveniences (which anyway would probably only apply to a minority of a minority, see how in the 喉が渇く example the ノド entries are now hidden).

stephenmk commented 2 months ago

Using JMdict-based dictionaries for Yomichan/Yomitan gives "あかばな" as furigana for "アカバナ" in アカバナ科, which is ugly and redundant

I updated my distribution of JMdict for yomitan to normalize the reading keys to match the corresponding kana usage within kanji keys. So アカバナ科 now appears without furigana over アカバナ. Other app devs are free to do the same.

akabana

JMdictProject / JMdictIssues

Simplification of readings in JMdict #137