FooSoft / yomichan-import

External dictionary importer for Yomichan.
https://foosoft.net/projects/yomichan-import/
MIT License
83 stars 23 forks source link

add epwing support for meikyou #2

Closed ghost closed 7 years ago

ghost commented 7 years ago

Introduction

I use the meikyou epwing dictionary on firefox with rikaisama but I am looking to move away from firefox after over a decade of use and onto chrome so I needed a replacement for rikaisama and yomichan is basically the best option. I've got it to the point where it works very well in yomichan as far as I can tell so I think this is a good time to make the pull request. Below are notes I took that will be helpful (I hope) in verifying that my work is correct to your standards (also, check your koohii PMs). Also I seriously don't know how you managed to put yourself through the bitmap glyph mappings for the daijirin extractor.

Also I am relatively inexperienced with the licensing stuff so I didn't know what to actually put at the top of the meikyou.go file I added; I will leave that to you.

Regexes

Normal expressions

These are enclosed in 【】.

Foreign expressions

These are enclosed in [] and contain a word and then optionally following that a country of origin. There are no headings which contain both foreign expressions and normal expressions. There are 5431 foreign expressions.

Readings and Other expressions

Normally the reading of the expression(s) preceeds 【】 in the heading. However, sometimes there are also other expressions placed there. Most other expressions look like an expression you would find normally enclosed in 【】 but there are 6 such "special" other expressions enclosed in parentheses characters that do not appear anywhere else. -- 〔小さい〕 -- 〔大きい〕 -- [夏] -- [秋] -- [春] -- [冬]

These are all obviously very common words so I do not think these are worth addressing. The difference between a match that is a reading and a match that is an other expression can be determined by checking if we found any expressions normally (the same way it is handled for daijirin).

Tags

Tags are wrapped in 〘〙 which are wide bitmap glyphs 45118 and 45119 respectively in the actual Meikyou epwing. Tags are separated by . When more than one set 〘〙 of enclosed tags exists in the text field, they are on different lines. I am assuming that the convention is to add rules that correspond to the EDICT tags so I have tried to stick to that as much as possible. That being said the exportRules code is a bit messy due to not using regexes as in the daijirin and daijisen extractors. I am not sure I understand the use of the rules (my guess is that it is used to help with yomichan's deinflector), and if the only rules that matter are "adj-i", "vs", "vk", "v5", and "v1" then the exportRules code I wrote can be simplified a lot more.

These are all the unique tags in Meikyou: ニ,トニ,他,他上一,他下一,他下二,他五,他四,他サ変,代,副,副ト,副トニ,副助,助動,助動 下一型,助動 下二型,助動 五型,助動 四型,助動 ラ変型,助動 形動型,助動 形型,助動 特活型,動上一,動下一,動下二,動五,動四,動サ変,動特活,名,形,形ク,形シク,形動,形動ナリ,形動トタル,感,接,接助,接尾,接頭,格助,終助,自,自上一,自上二,自下一,自下二,自五,自他,自他上一,自他下一,自他五,自他サ変,自四,自サ変,補動,補動五,補動四,補形,連体,連語

Glyph Tables

I manually created the tables based on the bitmap glyphs dumped from ebfont from your eb project repo on commit 6d0af07d883a239279d4984ce1785debabcf795d (which appears to still be the linked submodule in zero-epwing currently).

Below are several notes I recorded as I was going through the process of creating the narrow and wide tables. Despite the fact that there are many unused characters I spent a lot of making sure that the table is correct and for places I've left notes below for more information about particular glyphs where I thought it justified mention. I primarily used utf8-chartable.de, weblio.jp, kotobank.com, glyphwiki.org, and mdbg.net/chindict/chindict.php?page=radicals to hunt down obscure kanji and other weird characters in the wide table, and extended latin/greek/other characters in the narrow table.

I determined whether or not a glyph is unused based on whether or not dictionary entries came up after grepping for the inline markers on the output of the bundled zero-epwing binary with yomichan-import @ 816e9e605ea2079fd84dc2479f8f25565d463eda. I didn't look too closely at the zero-epwing code so in case it unintentionally filters out entries or some kind of text referring to those glyphs I tried to determine the mappings anyway in case it became relevant in later versions zero-epwing. Given how nonsensical this format is I wouldn't be surprised if those glyphs just never got used either though.

Notes on the narrow table

Unused characters: 41249-41257, 41259-41289, 41291-41312, 41314, 41316, 41318-41319, 41325-41327, 41329-41331, 41333-41334, 41336-41340, 41342, 41505-41507, 41509-41583, 41585-41589, 41591-41597, 41598, 41761-41775, 41777-41830, 41841, 41848-41850, 42021.

Notes on the wide table

Unused characters: 45089-45094, 45110-45111, 45113-45114, 45133-45138, 45140-45141, 45149, 45345, 45376, 45378, 45388, 45418, 45431, 45637-45638, 45685-45687, 45689-45690, 45858, 45862, 45864.

FooSoft commented 7 years ago

Thank you for your excellent work, I can say without any hesitation that this is the cleanest, best documented pull request I have received in all my years of doing open source on GitHub :1st_place_medal: I'm sure that your efforts will be highly appreciated by everyone using Yomichan to better their understanding of Japanese. I keep on wanting to expand yomichan-import with support for other dictionaries but it always has to be done at the expense of development time on the actual extension...

The bitmap glyph mappings are honestly a huge pain in the ass. I've managed to scrounge some tables from here and there, but they are not top quality (Daijirin and Daijisen tables have some errors). I've made changes to zero-epwing to dump out font glyph data in all available sizes, and I am planning on creating a simple OCR tool to build these tables automatically. Interestingly enough, there are no libraries that I have found that do a good job with Japanese character recognition; I'm hoping that I can get good results with my method (it will be run offline and the character tables will be hard-coded into the source files like they are now).

Regarding the copyright stuff, I'm not not too fussy about. The file is going to be MIT license like the rest of the project, and since you wrote everything in the file you will be credited as the author 👍