FooSoft / yomichan

Japanese pop-up dictionary extension for Chrome and Firefox.
https://foosoft.net/projects/yomichan
Other
1.07k stars 229 forks source link

[Feature Request] Match 旧字体 新字体 許容字体 標準字体 interchangeably #1625

Open epistularum opened 3 years ago

epistularum commented 3 years ago

The idea would be to enable matching between the different forms of kanji in order to broaden the usefulness of yomichan when it comes to words that contain alternate forms of kanji. The most annoying occurrence are kanjis that aren't part of the 常用漢字 but are sometimes typed in their 拡張新字体 form and sometimes their 正字体 form. Some dictionary prefer to only display one of the two type but both are regularly used (although 正字体 is the more "correct" form).

Notable examples: 正字体 - 許容字体/拡張新字体 噓 - 嘘 蟬 - 蝉 繫 - 繋 摑 - 掴 etc

I've already compiled a list of all characters in comparison to each other: https://github.com/epistularum/jitai

Thermospore commented 3 years ago

Currently yomichan has two ways I can think of to overcome this issue

The first is to use merged mode (group related terms). In this mode you select a primary dictionary which acts as an index of all the forms of a word. When you do a search, Yomichan checks all your secondary dictionaries for all the forms of the word listed in the primary dictionary.

For example my copy of 大辞林 only has 啞蟬 but my primary dictionary has both 啞蟬 and 唖蝉. So even if I search 唖蝉 the 大辞林 entry still gets pulled up image 啞蟬 was actually added to jmdict by yours truly, just for this purpose :)

jmdict is an ideal primary dictionary, since it already has a good database of alternate forms and it's very easy to contribute any additional forms you find (which I do often). You can also hide the english definitions if you don't want them

There are definitely forms still missing from JMdict, but I think they could be added en masse. Given the list you compiled, I bet a script could be written to check the headwords for a bunch of JJ dictionaries to find any alternate forms JMdict is missing.

The second approach would be to use the regex replacements feature image

toasted-nutbread commented 3 years ago

I've actually considered doing something similar for replacing characters that are visually similar, which could have potentially helped with scanning text generated from OCR that has mistakes, but the issue is that it quickly balloons the number of text variants that need to be scanned.

The lists you shared seem to be smaller than what I originally used, so maybe it's not as much of an issue. Is it correct to assume that all kanji would be converted to the same form, or would it be possible that multiple forms would be used in a same term/compound? If it's the latter, that gets back into the territory of generating very many additional search terms.

https://github.com/siikamiika/similar-kanji

epistularum commented 3 years ago

The list I made only takes into accounts forms that count as correct in the 日本漢字能力検定 (新字体 vs 旧字体 and 標準字体 vs 許容字体). Because of this, the list is very small and only takes into account characters that are actually interchangeably used, thus isn't really suitable for matching OCR errors.

Concerning the multiple form issue this occurs only a few times: 厩 廐|廏 熙 煕|熈 蕊 蘂|蕋 闘 鬪|鬭 弁 辨|瓣|辯 闘 鬪|鬭

https://www.kanken.or.jp/kanken/outline/degree/rating.html

  1. 2~10級の解答は、内閣告示「常用漢字表」(平成22年)による。ただし、旧字体での解答は正答とは認めない。
  2. 1級および準1級の解答は、『漢検要覧 1/準1級対応』(公益財団法人 日本漢字能力検定協会発行)に示す「標準字体」「許容字体」「旧字体一覧表」による。