JMdictProject / JMdictIssues

JMdict Japanese dictionary - lexicographic, etc. issues management
16 stars 1 forks source link

Ambiguous, priority-tagged keys in example sentence index strings #121

Open stephenmk opened 5 months ago

stephenmk commented 5 months ago

Occasionally I come across example sentences that are keyed to the incorrect entry because the key has been defined ambiguously. See this report about 易々 here for example. The index string contains 易々{やすやす}~ instead of 易々(やすやす){やすやす}~ and consequently the sentence ends up in the entry for いい【易々】 in the JMdict_e_examp file.

I'd like to fix all of these errors at once, so I tried to search for priority-tagged keys (i.e., keys with the ~ symbol appended) which could be considered ambiguous. Unfortunately there seem to be at least several hundred. The precise number depends upon how we define ambiguity.

Even if we assume that the key "ど" belongs to "ど[nokanji]" and also make use the sense number information, by my count there are still 331 ambiguous sentences. I posted a CSV file with the data here. Working through this list would be quite a challenge.

JMdictProject commented 5 months ago

Most of the ambiguous indices came about as a result of new JMdict entries being created since the original sentence indexing was done about 20 years ago. I think the 易々/やすやす/いい is such a case.

Disambiguation can be done in two ways:

Turning to the specific questions:

If a sentence is indexed to "ど" and we have two "ど" entries, one for "ど【土】" and one for "ど[nokanji]", is that considered ambiguous?

Probably not. They both should be indexed using KANJI(ど).

If it's indexed to "ど[05]" and there only exists one "ど" entry with five senses, is that still considered ambiguous?

No. The sense number is really only for matching by the reporting app; not for finding the sentence pair (at least for WWWJDIC - I can't comment on other apps.)

Since "易々{やすやす}" didn't get assigned to the correct entry, it seems safe to say that the information in curly braces is not used to disambiguate.

Correct. The part in {} is only to indicate the form in which the term appears in the sentence. In WWWJDIC it's used to enable the indexed term to be highlighted in the sentence display. I also use it in a validation program to verify the integrity of the indices.

stephenmk commented 5 months ago

OK, so if the index parser is simple and doesn't use all the information available to find the correct entry, then by my count there are 572 instances of keys within index strings that we can consider to be ambiguous.

We can probably use the extra available information (sense numbers, readings within curly braces, "usually kana" info) to fix a couple hundred of these instances automatically. If I were to provide a list of sentence IDs, index strings, and fixed index strings, would we be able to run a bulk update (find-and-replace) on the index database?

By the way, out of those 572 instances, only 8 contain readings in parentheses. We'll need to replace these readings with explicit sequence numbers since the kanji-reading pairs all belong to more than one entry. (In the case of 家・うち, the pair is in entry 1457730 as a search-only form)

Sent_ID Index String Index Entries
74073 靴紐{靴ひも} が 解ける(とける)[02]{とけた}~ 解ける(とける)[02]{とけた}~ 1546070;1198910
148483 朱(しゅ)[02]~ に 交わる{交われば} 赤い[01]{赤く} 成る[01]{なる} 朱(しゅ)[02]~ 2273400;2856794
174571 固体~ が 解ける(とける)[04]~ と 液体 になる[01] 解ける(とける)[04]~ 1546070;1198910
187003 家(うち)[01]~ へ 着く(つく)[01]{ついたら} 電話 が 鳴る{鳴っていた} 家(うち)[01]~ 1457730;1191740
191733 我々{われわれ} の 全て{すべて} が 生まれつき 音楽 の 才(さい)[01]~ が[01] 有る{ある} 訳ではない{わけではない} 才(さい)[01]~ 1294940;2835063
213327 其の[01]{その} お陰で{おかげで} 誤解 が 解ける(とける)[03]~ 解ける(とける)[03]~ 1546070;1198910
228246 家(うち)[01]{うち}~ には 十 頭(とう) の 牛[01] が 居る(いる)[01]{いる} 家(うち)[01]{うち}~ 1457730;1191740
229051 一番(いちばん)[01]{いちばん} 頭(あたま)[02] の 良い 生徒 でさえ[01] 其の[01]{その} 問題 は 解ける(とける)[01]{解けなかった}~ 解ける(とける)[01]{解けなかった}~ 1546070;1198910
JMdictProject commented 3 months ago

OK, so if the index parser is simple and doesn't use all the information available to find the correct entry, then by my count there are 572 instances of keys within index strings that we can consider to be ambiguous.

I don't think there are 572 ambiguous instances. If we look at the first line in your table:

Yes, 解ける is found in two entries (1546070;1198910) but it is only the leading kanji form in 1198910. In 1546070 the leading kanji form is 溶ける and sentences are linked to that form. You can verify this by looking up 解ける in WWWJDIC.

Finding the actual ambiguous instances is tricky. I don't think the sense numbers and the written forms in curly braces are actually much use for that.

JMdictProject commented 3 months ago

It's the same with 家(うち) and 才(さい) - there are actually no ambiguities in the sentence linking. 朱(しゅ) was ambiguous so I replaced the reading with the entry number.

stephenmk commented 1 month ago

Yes, 解ける is found in two entries (1546070;1198910) but it is only the leading kanji form in 1198910.

This "leading kanji form" method doesn't seem to work reliably. For example, sentence #142850 has the following index string.

生きる{生きている} 鯨{クジラ}~ を 見る{見た} 事がある{ことがある}

In the latest version of the JMdict_e_examp file, this sentence is found in entry 2846155 (いさな) rather than entry 1253270 (くじら), even though 鯨 is the leading form only in the latter.

I don't think the sense numbers and the written forms in curly braces are actually much use for that.

In the example above, the クジラ in curly braces shows that the sentence belongs to the くじら entry rather than いさな. I'm only suggesting that the info could be a useful heuristic for spotting these incorrectly indexed sentences.

I've been meaning to get around to doing some more work on this issue. I might have some more progress to share soon.

JMdictProject commented 1 month ago

Yes, the arrival of that 鯨/いさな entry meant that it competes with 鯨/くじら for which one gets linked. I've now amended the 鯨 links to 鯨(くじら) which should fix it.

If I get the time and energy I should check the unqualified kanji indices in the sentences for cases where there are multiple dictionary entries with the same form. Fixing them can be a problem - Tatoeba's global edit is very good, but it has a problem with cases where the index form is at the start of the sentence.