Ambiguous, priority-tagged keys in example sentence index strings

stephenmk commented 5 months ago

Occasionally I come across example sentences that are keyed to the incorrect entry because the key has been defined ambiguously. See this report about 易々 here for example. The index string contains 易々{やすやす}~ instead of 易々(やすやす){やすやす}~ and consequently the sentence ends up in the entry for いい【易々】 in the JMdict_e_examp file.

I'd like to fix all of these errors at once, so I tried to search for priority-tagged keys (i.e., keys with the ~ symbol appended) which could be considered ambiguous. Unfortunately there seem to be at least several hundred. The precise number depends upon how we define ambiguity.

If a sentence is indexed to "ど" and we have two "ど" entries, one for "ど【土】" and one for "ど[nokanji]", is that considered ambiguous?
If it's indexed to "ど[05]" and there only exists one "ど" entry with five senses, is that still considered ambiguous?
Since "易々{やすやす}" didn't get assigned to the correct entry, it seems safe to say that the information in curly braces is not used to disambiguate.

Even if we assume that the key "ど" belongs to "ど[nokanji]" and also make use the sense number information, by my count there are still 331 ambiguous sentences. I posted a CSV file with the data here. Working through this list would be quite a challenge.

JMdictProject commented 5 months ago

Most of the ambiguous indices came about as a result of new JMdict entries being created since the original sentence indexing was done about 20 years ago. I think the 易々/やすやす/いい is such a case.

Disambiguation can be done in two ways:

adding a reading after the kanji form. In the 易々 case, I have done that by making the index "易々(やすやす)". This is the preferred method.
adding the JMdict sequence number, e.g. "で(#2028980)". This is the only approach that works for kana indices.

Turning to the specific questions:

If a sentence is indexed to "ど" and we have two "ど" entries, one for "ど【土】" and one for "ど[nokanji]", is that considered ambiguous?

Probably not. They both should be indexed using KANJI(ど).

If it's indexed to "ど[05]" and there only exists one "ど" entry with five senses, is that still considered ambiguous?

No. The sense number is really only for matching by the reporting app; not for finding the sentence pair (at least for WWWJDIC - I can't comment on other apps.)

Since "易々{やすやす}" didn't get assigned to the correct entry, it seems safe to say that the information in curly braces is not used to disambiguate.

Correct. The part in {} is only to indicate the form in which the term appears in the sentence. In WWWJDIC it's used to enable the indexed term to be highlighted in the sentence display. I also use it in a validation program to verify the integrity of the indices.

stephenmk commented 5 months ago

OK, so if the index parser is simple and doesn't use all the information available to find the correct entry, then by my count there are 572 instances of keys within index strings that we can consider to be ambiguous.

We can probably use the extra available information (sense numbers, readings within curly braces, "usually kana" info) to fix a couple hundred of these instances automatically. If I were to provide a list of sentence IDs, index strings, and fixed index strings, would we be able to run a bulk update (find-and-replace) on the index database?

By the way, out of those 572 instances, only 8 contain readings in parentheses. We'll need to replace these readings with explicit sequence numbers since the kanji-reading pairs all belong to more than one entry. (In the case of 家・うち, the pair is in entry 1457730 as a search-only form)

Sent_ID	Index String	Index	Entries
74073	靴紐{靴ひも} が解ける(とける)[02]{とけた}~	解ける(とける)[02]{とけた}~	1546070;1198910
148483	朱(しゅ)[02]~ に交わる{交われば} 赤い[01]{赤く} 成る[01]{なる}	朱(しゅ)[02]~	2273400;2856794
174571	固体~ が解ける(とける)[04]~ と液体になる[01]	解ける(とける)[04]~	1546070;1198910
187003	家(うち)[01]~ へ着く(つく)[01]{ついたら} 電話が鳴る{鳴っていた}	家(うち)[01]~	1457730;1191740
191733	我々{われわれ} の全て{すべて} が生まれつき音楽の才(さい)[01]~ が[01] 有る{ある} 訳ではない{わけではない}	才(さい)[01]~	1294940;2835063
213327	其の[01]{その} お陰で{おかげで} 誤解が解ける(とける)[03]~	解ける(とける)[03]~	1546070;1198910
228246	家(うち)[01]{うち}~ には十頭(とう) の牛[01] が居る(いる)[01]{いる}	家(うち)[01]{うち}~	1457730;1191740
229051	一番(いちばん)[01]{いちばん} 頭(あたま)[02] の良い生徒でさえ[01] 其の[01]{その} 問題は解ける(とける)[01]{解けなかった}~	解ける(とける)[01]{解けなかった}~	1546070;1198910

JMdictProject commented 3 months ago

OK, so if the index parser is simple and doesn't use all the information available to find the correct entry, then by my count there are 572 instances of keys within index strings that we can consider to be ambiguous.

I don't think there are 572 ambiguous instances. If we look at the first line in your table:

74073 靴紐{靴ひも} が解ける(とける)[02]{とけた}~ ... 1546070;1198910

Yes, 解ける is found in two entries (1546070;1198910) but it is only the leading kanji form in 1198910. In 1546070 the leading kanji form is 溶ける and sentences are linked to that form. You can verify this by looking up 解ける in WWWJDIC.

Finding the actual ambiguous instances is tricky. I don't think the sense numbers and the written forms in curly braces are actually much use for that.

JMdictProject commented 3 months ago

It's the same with 家(うち) and 才(さい) - there are actually no ambiguities in the sentence linking. 朱(しゅ) was ambiguous so I replaced the reading with the entry number.

stephenmk commented 1 month ago

Yes, 解ける is found in two entries (1546070;1198910) but it is only the leading kanji form in 1198910.

This "leading kanji form" method doesn't seem to work reliably. For example, sentence #142850 has the following index string.

生きる{生きている} 鯨{クジラ}~ を 見る{見た} 事がある{ことがある}

In the latest version of the JMdict_e_examp file, this sentence is found in entry 2846155 (いさな) rather than entry 1253270 (くじら), even though 鯨 is the leading form only in the latter.

I don't think the sense numbers and the written forms in curly braces are actually much use for that.

In the example above, the クジラ in curly braces shows that the sentence belongs to the くじら entry rather than いさな. I'm only suggesting that the info could be a useful heuristic for spotting these incorrectly indexed sentences.

I've been meaning to get around to doing some more work on this issue. I might have some more progress to share soon.

JMdictProject commented 1 month ago

Yes, the arrival of that 鯨/いさな entry meant that it competes with 鯨/くじら for which one gets linked. I've now amended the 鯨 links to 鯨(くじら) which should fix it.

If I get the time and energy I should check the unqualified kanji indices in the sentences for cases where there are multiple dictionary entries with the same form. Fixing them can be a problem - Tatoeba's global edit is very good, but it has a problem with cases where the index form is at the start of the sentence.

JMdictProject / JMdictIssues

Ambiguous, priority-tagged keys in example sentence index strings #121