Disambiguating cross references

stephenmk commented 2 years ago

I've been doing some analysis on cross references in the JMdict file. There are quite a few them that are ambiguous to my program but would not be ambiguous if they merely included the kanji form of the word. (They're often ambiguous from the point of view of a program searching the file, although they might be quite unambiguous to a human reader).

For example, Seq #2029190 contains a reference to sense #3 of あく(no kanji specified), which could either mean 灰汁, 開く, or 飽く (but perhaps not 悪, which only contains two senses).

Would it be helpful for me to submit edits containing these kanji forms? I want to make sure before I begin sending a flood of submissions (not sure how many yet, but potentially dozens or over a hundred). I understand that these references will be supplemented with sequence numbers in the near future, so I'm not sure if these submissions are wanted or necessary. I found at least one entry which used to contain unambiguous cross references, but then the kanji forms were removed (Seq #1009290 何れ・どれ).

JMdictProject commented 2 years ago

What you describe is a problem with the current JMdict file structure, which allows these sorts of ambiguities to arise. The underlying database does not have this problem. We are in the process of moving to a revised XML structure that will make cross-references quite unambiguous. See https://www.edrdg.org/wiki/index.php/JMdict:_Next_Generation#Cross-References for details of that.

While changing the cross-reference in 2029190 to point at the kanji form of 1201960 might help in the short term, it would be a bit misleading as the kanji form is rarely used. It might be better to hold off until the new structure comes into being, hopefully a bit later this year.

stephenmk commented 2 years ago

Sounds good! Thank you for the quick and detailed reply.

JMdictProject / JMdictIssues

Disambiguating cross references #61