Closed stephenmk closed 3 months ago
Most of the ambiguous indices came about as a result of new JMdict entries being created since the original sentence indexing was done about 20 years ago. I think the 易々/やすやす/いい is such a case.
Disambiguation can be done in two ways:
Turning to the specific questions:
If a sentence is indexed to "ど" and we have two "ど" entries, one for "ど【土】" and one for "ど[nokanji]", is that considered ambiguous?
Probably not. They both should be indexed using KANJI(ど).
If it's indexed to "ど[05]" and there only exists one "ど" entry with five senses, is that still considered ambiguous?
No. The sense number is really only for matching by the reporting app; not for finding the sentence pair (at least for WWWJDIC - I can't comment on other apps.)
Since "易々{やすやす}" didn't get assigned to the correct entry, it seems safe to say that the information in curly braces is not used to disambiguate.
Correct. The part in {} is only to indicate the form in which the term appears in the sentence. In WWWJDIC it's used to enable the indexed term to be highlighted in the sentence display. I also use it in a validation program to verify the integrity of the indices.
OK, so if the index parser is simple and doesn't use all the information available to find the correct entry, then by my count there are 572 instances of keys within index strings that we can consider to be ambiguous.
We can probably use the extra available information (sense numbers, readings within curly braces, "usually kana" info) to fix a couple hundred of these instances automatically. If I were to provide a list of sentence IDs, index strings, and fixed index strings, would we be able to run a bulk update (find-and-replace) on the index database?
By the way, out of those 572 instances, only 8 contain readings in parentheses. We'll need to replace these readings with explicit sequence numbers since the kanji-reading pairs all belong to more than one entry. (In the case of 家・うち, the pair is in entry 1457730 as a search-only form)
Sent_ID | Index String | Index | Entries |
---|---|---|---|
74073 | 靴紐{靴ひも} が 解ける(とける)[02]{とけた}~ | 解ける(とける)[02]{とけた}~ | 1546070;1198910 |
148483 | 朱(しゅ)[02]~ に 交わる{交われば} 赤い[01]{赤く} 成る[01]{なる} | 朱(しゅ)[02]~ | 2273400;2856794 |
174571 | 固体~ が 解ける(とける)[04]~ と 液体 になる[01] | 解ける(とける)[04]~ | 1546070;1198910 |
187003 | 家(うち)[01]~ へ 着く(つく)[01]{ついたら} 電話 が 鳴る{鳴っていた} | 家(うち)[01]~ | 1457730;1191740 |
191733 | 我々{われわれ} の 全て{すべて} が 生まれつき 音楽 の 才(さい)[01]~ が[01] 有る{ある} 訳ではない{わけではない} | 才(さい)[01]~ | 1294940;2835063 |
213327 | 其の[01]{その} お陰で{おかげで} 誤解 が 解ける(とける)[03]~ | 解ける(とける)[03]~ | 1546070;1198910 |
228246 | 家(うち)[01]{うち}~ には 十 頭(とう) の 牛[01] が 居る(いる)[01]{いる} | 家(うち)[01]{うち}~ | 1457730;1191740 |
229051 | 一番(いちばん)[01]{いちばん} 頭(あたま)[02] の 良い 生徒 でさえ[01] 其の[01]{その} 問題 は 解ける(とける)[01]{解けなかった}~ | 解ける(とける)[01]{解けなかった}~ | 1546070;1198910 |
OK, so if the index parser is simple and doesn't use all the information available to find the correct entry, then by my count there are 572 instances of keys within index strings that we can consider to be ambiguous.
I don't think there are 572 ambiguous instances. If we look at the first line in your table:
Yes, 解ける is found in two entries (1546070;1198910) but it is only the leading kanji form in 1198910. In 1546070 the leading kanji form is 溶ける and sentences are linked to that form. You can verify this by looking up 解ける in WWWJDIC.
Finding the actual ambiguous instances is tricky. I don't think the sense numbers and the written forms in curly braces are actually much use for that.
It's the same with 家(うち) and 才(さい) - there are actually no ambiguities in the sentence linking. 朱(しゅ) was ambiguous so I replaced the reading with the entry number.
Yes, 解ける is found in two entries (1546070;1198910) but it is only the leading kanji form in 1198910.
This "leading kanji form" method doesn't seem to work reliably. For example, sentence #142850 has the following index string.
生きる{生きている} 鯨{クジラ}~ を 見る{見た} 事がある{ことがある}
In the latest version of the JMdict_e_examp
file, this sentence is found in entry 2846155 (いさな) rather than entry 1253270 (くじら), even though 鯨 is the leading form only in the latter.
I don't think the sense numbers and the written forms in curly braces are actually much use for that.
In the example above, the クジラ in curly braces shows that the sentence belongs to the くじら entry rather than いさな. I'm only suggesting that the info could be a useful heuristic for spotting these incorrectly indexed sentences.
I've been meaning to get around to doing some more work on this issue. I might have some more progress to share soon.
Yes, the arrival of that 鯨/いさな entry meant that it competes with 鯨/くじら for which one gets linked. I've now amended the 鯨 links to 鯨(くじら) which should fix it.
If I get the time and energy I should check the unqualified kanji indices in the sentences for cases where there are multiple dictionary entries with the same form. Fixing them can be a problem - Tatoeba's global edit is very good, but it has a problem with cases where the index form is at the start of the sentence.
I set up a repo here on GitHub to track my edits to the Tatoeba 'sentence annotations' database. It is synchronized once a week with the examples.utf
file that is published on the EDRDG's FTP server.
https://github.com/stephenmk/jmdict-tatoeba-sentence-linking/commits/main/
My fixes to problematic index strings will be visible here as I work on this issue over time.
Occasionally I come across example sentences that are keyed to the incorrect entry because the key has been defined ambiguously. See this report about 易々 here for example. The index string contains
易々{やすやす}~
instead of易々(やすやす){やすやす}~
and consequently the sentence ends up in the entry for いい【易々】 in theJMdict_e_examp
file.I'd like to fix all of these errors at once, so I tried to search for priority-tagged keys (i.e., keys with the
~
symbol appended) which could be considered ambiguous. Unfortunately there seem to be at least several hundred. The precise number depends upon how we define ambiguity.Even if we assume that the key "ど" belongs to "ど[nokanji]" and also make use the sense number information, by my count there are still 331 ambiguous sentences. I posted a CSV file with the data here. Working through this list would be quite a challenge.