JMdictProject / JMdictIssues

JMdict Japanese dictionary - lexicographic, etc. issues management
18 stars 1 forks source link

JMdict and furigana #118

Closed JMdictProject closed 3 months ago

JMdictProject commented 10 months ago

This is for discussion associated with the JMdict New Generation initiative (https://www.edrdg.org/wiki/index.php/JMdict:_Next_Generation).

One topic under discussion is https://www.edrdg.org/wiki/index.php/JMdict:_Next_Generation#Entry-wide_Inflection_Pattern_Elements See at the end of #111 for related comments.

stephenmk commented 10 months ago

Not sure if this is already on the list somewhere (I searched for "furigana" and found nothing), but it would be handy to distribute this furigana information within the main file.

https://github.com/Doublevil/JmdictFurigana

If this was integrated, we'd definitely want to have a program on the backend to produce the furigana info automatically for new terms. A good implementation would be able to correctly solve the vast majority of reading and surface form combinations. Otherwise we'd have to update the furigana information manually all the time, which would be pretty tedious.

I have written a modestly capable furigana distribution program in python, although I haven't (yet?) designed it to distribute the correct kanji readings over each individual kanji. I'd need to integrate the reading information from kanjidic to do something like that, but I think it would work well.

Anyway, this is all just a thought that occurred to me. I understand if it's too much work to fit in with the current plan.

JMdictProject commented 10 months ago

Re that furigana project, it was discussed a bit in mid-2021 on the mailing list: https://www.edrdg.org/jmdict_edict_list/2021/msg00075.html and following.

While I am not a fan of furigana, and I wouldn't be interested in embedding it in the JMdict database, the ability to add it on-the-fly to JMdict entries might be useful. That's why I suggested that it would help if that furigana project included the JMdict sequence numbers to help such alignment. It was raised as an issue there, but the guy running the project didn't seem interested and it fizzled out.

hlorenzi commented 10 months ago

I've also proposed privately a few years ago in April 2020 (didn't know there were discussion groups!) for the inclusion of furigana data in the JMdict database, but it was also rejected.

The algorithm I've built is able to correctly segment furigana for the most complex of JMdict's entries (for example, 外国為替及び外国貿易管理法). I've been collecting and correcting furigana data throughout the years as I came across problematic entries myself. I've got a file for individual and gikun reading data and another one for full entries which trip up the algorithm. It's also through the furigana algorithm that I've been able to detect some mistakes in JMdict, looking for strange readings in a kanji's list of words by reading.

I wouldn't say my algorithm is suitable for on-the-fly attachment, though, since it's very brute-force-y, and its almost-5-year-old code could use a total rewrite. It does work well for most terms, both old and new, and I've been checking its results manually throughout these years. I could generate a companion file for distribution with JMdict, with sequence numbers and anything else that's needed. I think something in the format of my furigana_patches.txt file would serve the purpose well, while being very compact.

JMdictProject commented 10 months ago

I've also proposed privately a few years ago in April 2020 (didn't know there were discussion groups!) for the inclusion of furigana data in the JMdict database, but it was also rejected.

Yes, that proposal came to me. Here is my response:

Re your suggestion to add additional fields to the JMdict database and distribution to support the display of accurate furigana for the kanji surface forms, I am afraid I don't think that should be done, for several reasons.

(1) it would add considerable complexity to the database structure, which is complex enough at present meeting its goals of supporting a dictionary of (modern) Japanese.

(2) such an addition would also require extensive changes and additions to the editing software. We are busy enough at present maintaining and extending the system without adding a whole extra functionality.

(3) the task of adding/editing entries would become more complex as the furigana portions would need to be added or changed as the entries themselves change. We really have our hands full looking after the job at present without adding to it.

(4) furigana-supporting information is not central to a Japanese dictionary. I see it as a teaching aid. While I agree it would be convenient for people who want to display furigana to have a single reliable source, I think this is not something that justifies a very significant change to the dictionary project.

I think a far better approach would be to have a parallel database in which the surface forms of Japanese terms in JMdict could be held and extracted by systems that want to use them. Something like:

2681180 鎧[よろい]板[いた] よろい板[いた] 1582710 日[にっ]本[ぽん] 日[に]本[ほん]

should suffice. Once it was established, it would be relatively easy to run an automated check periodically to detect changes in the dictionary. It could be developed and looked after by people with an interest in displaying furigana.

stephenmk commented 10 months ago

I think a far better approach would be to have a parallel database in which the surface forms of Japanese terms in JMdict could be held and extracted by systems that want to use them.

That's basically the status quo. The JmdictFurigana repo publishes a new JSON file with all the segmented data once a month. Any new terms added to JMdict between months naturally won't be included, nor will the handful of terms that the project's program fails to segment correctly. So the only advantage to including this information in the JMdict file (besides the convenience) is to handle that relatively small group of terms. I think it would be nice, but it's not a huge loss if nobody else is on board with the idea.

hlorenzi commented 5 months ago

I'll begin to publish my furigana segmentation results at https://jisho.hlorenzi.com/furigana.txt Contributions and fixes are of course welcome at the source repository!

This file is about 25 MB uncompressed, and contains data for both JMdict and JMnedict entries. To keep the size as small as possible, the file doesn't contain JMdict sequence IDs, since you should be able to match headwords with the segmented data pretty easily.

The website automatically updates its entire database with fresh JMdict data a couple of times every week, and this file will be refreshed at the same time.

After 4 years of working on the algorithm and testing its coverage, I feel like most (>99%) JMdict entries have correct furigana segmentations. JMnedict entries still require a lot of work, though, due to the sheer amount of entries and irregular readings. I can already see clearly-wrong segmentations in the first several lines in the file.

JMdictProject commented 5 months ago

I set up this issue as a place for NG discussion. So far it's all been about furigana, so I've changed the label. As I wrote back in January, I don't think furigana details should be included in JMdict.

JMdictProject commented 3 months ago

To keep the size as small as possible, the file doesn't contain JMdict sequence IDs,

A great pity. I think it makes the data rather more difficult to use.

The issue can be closed now.

hlorenzi commented 2 months ago

To keep the size as small as possible, the file doesn't contain JMdict sequence IDs,

A great pity. I think it makes the data rather more difficult to use.

I've implemented this now, so every entry is accompanied by the corresponding sequence ID. Hopefully this makes it more useful!

I just realized this lack of sequence IDs issue had been mentioned at the start of this thread too, regarding the other furigana repo. It wasn't my intention to dismiss the need for such a feature!

JMdictProject commented 2 months ago

There's also the one at https://github.com/Doublevil/JmdictFurigana AFAIK it doesn't include the sequence numbers.