JMdictProject / JMdictIssues

JMdict Japanese dictionary - lexicographic, etc. issues management
16 stars 1 forks source link

Similar-meaning homophone (異字同訓) distinctions #107

Open stephenmk opened 8 months ago

stephenmk commented 8 months ago

I was recently told that a user isn't satisfied with JMdict's handling of similar homophone / 異字同訓 terms. I pointed out that JMdict does have many sense notes for indicating that particular senses are especially used with particular kanji forms. The original port of JMdict data to the Yomichan web extension did not include any of these sense notes, which may have caused the perception that this information is not available. But it is true that there are some entries missing these sorts of notes.

At any rate, this got me thinking about how JMdict can be improved with regard to these words. Bret Mayer's website contains about 150 such groups of terms with many example sentences. He says it was translated from a list compiled by the Japanese government Council for Cultural Affairs. Also, Kanjipedia has published a large list of these words with definitions and examples in Japanese.

In the new edition of Sanseido's 国語辞典 ("sankoku"), entries for these types of words contain markers which point to adjacent entries of the same reading. The appendix in sankoku describes these markers as "書き分け注意." There are about 1,500 "groups" of these entries. These groups are not collections of all words using the same reading, but rather homophones that are similar in meaning. For example, 膿む/熟む and 生む/産む are four separate entries in sankoku with the same reading (うむ) comprising two separate "groups."

I compiled a list of these groups from sankoku and attempted to correlate them to JMdict sequence numbers. The list is in a Google spreadsheet and is editable by everyone in the edict-jmdict google group.

I attempted to identify which JMdict entries have these forms merged. Presumably words that are split into separate JMdict entries (such as 膿む and 熟む) don't require much attention. For forms that are merged (such as ほか for 他 and 外), we may want to go down the list and double check that the JMdict entries adequately draw a distinction between the different forms.

JMdictProject commented 8 months ago

A very interesting table, and a good starting point if someone wants to check/verify/amend the handling.

I looked through a few of the ones which are merged into single JMdict entries, e.g. 香り and 薫り, but I didn't see any that I felt needed flagging. In the 香り/薫り case, all the references I checked had them as alternatives without any suggestion the meanings differed according to the kanji form.

FragozoLeonardo commented 8 months ago

I'm the person who opened that issue on Stephen's project, how I can help? I'm not proficient (as of now) in Japanese, but I can understand Japanese - Japanese Definitions in Monolingual Dictionaries.

stephenmk commented 8 months ago

The entries identified in column D might be in need of attention. I have highlighted these cells in yellow.

spreadsheet

For example, we currently have 良い, 好い, and 善い merged into the same entry. Bret's ijidoukun page describes 善い as meaning "virtuous." Various Japanese dictionaries also make a note of this meaning. The current JMdict entry has all of these forms merged together, but it doesn't have a sense for "righteous" or "virtuous." I wrote "yes" in the "Needs attention?" column in the spreadsheet.

If you'd like to help, you can go through the spreadsheet and check to see if the identified entries adequately explain the differences between the corresponding kanji forms. I granted permissions to the *dev@gmail.com account that you have on your github profile, so you should be able to edit the spreadsheet now.

FragozoLeonardo commented 8 months ago

Okay! I will check them at least two or three / day, I have full time work + study, but I defitively can help, thank you for granting the permissions.