FooSoft / yomichan-import

External dictionary importer for Yomichan.
https://foosoft.net/projects/yomichan-import/
MIT License
83 stars 23 forks source link

add epwing support for kotowaza #4

Closed ghost closed 7 years ago

ghost commented 7 years ago

Introduction

This is a dictionary I have wanted to use for a while which did not work with rikaisama nor some other old desktop epwing reader application. With the exception of a very small number of cases noted below, all the useful information contained in the epwing can be extracted.

This dictionary apparently is still being produced and sold by 三省堂 as part of the 新明解 dictionaries, however the title contained in the epwing merely says 故事ことわざの辞典. This epwing may be of a much older revision before the 新明解 label.

Regexes

Almost all of the headings are remarkably clean. The heading text is simply the proverb or idiom, with a reading for every word.

However, there are headings which contain alternate forms.

The approach taken in the implementation of the extractor is to determine all reduced forms of the expression. A reduced form of an expression is one which has no word alternatives. Then for each reduced form, all possible readings are determined for that particular form.

The following is an example parse of a simple case where there are three possible variations of a proverb.

-- 今参(いままい)り=二十日(はつか)〔=百日(ひゃくにち)・三日(みっか)〕 -> 今参(いままい)り二十日(はつか) or 今参(いままい)り百日(ひゃくにち) or 今参(いままい)り三日(みっか)

Below is the information about possible scenarios regarding alternate forms.

Word alternatives are indicated by the character. Every with non-bracketed text that follows is paired with alternatives enclosed by 〔= and . There are 886 headings with word alternatives enclosed by 〔= and . There are 886 headings with such alternatives.

There is 1 heading that is an exception to the indicator and has no alternatives:

A quick google search for 勝地は主なし does not come up with any exact matches but instead 勝地定主無し comes up which appears to have the same meaning. Nevertheless, this single exception is left unhandled so as not to unnecessarily complicate the regex.

Alternatives are also specified with the character. This can denote alternatives for readings (the most common case).

There are 197 headings with reading alternatives enclosed by ().

There are at most 3 reading alternatives listed and this is the case for exactly 1 heading.

The character can also denote alternatives for words enclosed by 〔= and .

There are 26 headings with additional word alternatives denoted by the character.

There are at most 3 additional word alternatives denoted by the and this is the case for exactly 3 headings.

There are exactly 9 headings where for a group of words (the primary and its alternatives), the primary word has more than one reading.

There are exactly 2 headings which contain word alternatives which also contain reading alternatives.

This case is left unhandled so as not to unnecessarily complicate the regex.

There are exactly 2 headings with word alternatives where readings do not immediately follow a kanji.

This case is left unhandled so as not to unnecessarily complicate the regex.

Glyph Tables

Thankfully, none of the bitmap glyphs are ever referenced in the dictionary entries.

Tags

Putting aside the fact that there is no grammatical metadata included in the entries, I do not think it is necessarily a good idea to try and apply deinflection on proverbs. In any case, there is no feasible way to determine grammatical metadata for the entries.

FooSoft commented 7 years ago

Excellent work, I was pretty interested in this dictionary as well, but didn't have the time to hook it up 👍 Thanks again for the codes and clear explanation!