Doublevil / JmdictFurigana

A Japanese dictionary resource that attaches furigana to individual words
150 stars 13 forks source link

The JmdictFurigana project

Download the latest release of the JmdictFurigana file.

What is it?

This project aims to build an open-source furigana resource to complement the EDICT/Jmdict and ENAMDICT/Jmnedict dictionary files. What it does is provide a link between kanji reading and kana reading by attaching the kana portions on the right kanji characters in individual dictionary words.

Concretely, if you are building an application with the EDICT/Jmdict file, you can use the output of this project to display pretty furigana over your words instead of a plain kana string.

What it is NOT

JmdictFurigana is not a lexical parser. It is designed around individual words, not for sentences.

In other words, where lexical parsers are identifying words in a sentence or an expression, JmdictFurigana aims to identify individual kanji readings in a word.

As such, it is discouraged to use it in tools that provide furigana over entire sentences.

For non-initiates

The EDICT (or Jmdict) and ENAMDICT (or Jmnedict) files are Japanese word dictionary files that contain, for each entry:

Our goal is to attach the right parts of the kana reading to the right kanji in the kanji reading.

Capture

How can I use it?

Download the latest release of the furigana files.

A new release is built automatically the 25th of every month through GitHub actions, with updated dictionary files. As Jmdict keeps evolving, so does JmdictFurigana.

In the latest release, there are two sets of files you can use: either the json files, or the compact plain text format.

Note that the json files are also available in zipped formats (both .zip and .tar.gz) for lighter downloads.

How to use the Json files

There are two files you can use:

Please note that the json files available in the releases are zipped using gzip (hence the .gz file extension), because they are very large. You may need a third-party zip utility to unzip them.

Both files are formatted in the exact same way: they are a json array containing entries as objects in the following format:

Example Json entry

Here is an example entry from the JmdictFurigana.json file:

{
  "text": "大人買い",
  "reading": "おとながい",
  "furigana": [
    {
      "ruby": "大人",
      "rt": "おとな"
    }, {
      "ruby": "買",
      "rt": "が"
    }, {
      "ruby": "い"
    }
  ]
}

In this example, the word is 大人買い, read as おとながい, and the furigana array breaks it down in 3 parts:

Note: In this example, the expression "大人" uses a special reading: "おとな". This reading cannot be cut in お and とな or おと and な. This is why the "おとな" furigana applies to the whole expression.

How to use the plain text format

This format is historical but will keep being supported, because it is compact. You might want to use the json files instead, as they are probably easier to parse.

There are two files you can use:

Both files are text files containing lines of data following this format: <kanji reading>|<kana reading>|<furigana string>

The <furigana string> itself consists of chains of the following pattern, separated by ';':

<startIndex>(-<endIndex>):<kana string>

Indexes describe the position of the concerned characters in the kanji reading that are attached the kana string. If the end index is not specified, the kana string applies only on the character at the start index.

Let's take some examples

頑張る|がんばる|0:がん;1:ば

大人買い|おとながい|0-1:おとな;2:が

Note: In this last example, the expression "大人" uses a special reading: "おとな". This reading cannot be cut in お and とな or おと and な. This is why our "おとな" furigana applies to the whole expression.

How does it work?

The solver that finds out what kanji matches what kana string uses multiple algorithms that may solve specific cases. The main algorithm uses the kanji readings read from the kanjidic files. It browses the kanji reading and recursively tries to match the kana string using all possible combinations of readings. This does not always work, because of special readings, missing readings and other oddities.

Other algorithms can solve entries with a kanji reading that contains only one kanji, entries where there are no consecutive kanji, and other specific cases.

These algorithms are run one after another and they all return the solutions found (if found). In the end, if there is only one solution, or if all solutions are equivalent, the single solution is retained.

There are also lists that contain exceptions and special readings. These lists are filled manually and will probably never be complete, given the massive amount of work that it represents.

The latest release of the Furigana file for the Jmdict was built in about two minutes and solved 177770 entries out of 234814 (keep in mind that a lot of entries are not even possible to "solve" because they do not contain kanji).

The latest Jmnedict file solved 584141 out of 741346 entries in about 3 minutes.

Fiability

While results are not 100% accurate, they are verified with an algorithm that checks that no kanji is left without furigana and that the expression reads correctly.

I am aware of an issue that incorrectly cuts certain special expressions because of the same-length algorithm. I consider these issues minor in number and importance.

The JmnedictFurigana file is less accurate, because proper names most often use special readings, weird contractions and the likes.

Running the solution

The solution is missing the ./JmdictFurigana/Resources/JMnedict.xml file because it is too big to commit here. You can download it on the ENAMDICT/Jmnedict project page.

Contribution and contact

If you have any questions or remarks regarding the project, or want to report errors, don't hesitate to file an issue or contact me through GitHub.

You can also contribute directly very easily if you notice an error with a special expression, by editing the SpecialExpressions.txt file.

Licence

This resource is distributed under the same licence as JMDict (Creative Commons Attribution-ShareAlike Licence).

Release notes

2.3.1 (2024-04-29):

2.3 (2020-09-24):

2.2 (2020-08-22):

2.1 (2019-07-24):

2.0 (2017-07-16):

1.4 (2016-11-13):

1.3 (2016-08-21):

1.2 (2016-04-10):

1.1 (2016-03-26):