Doublevil / JmdictFurigana

A Japanese dictionary resource that attaches furigana to individual words
152 stars 13 forks source link

JSON? #11

Closed fasiha closed 5 years ago

fasiha commented 5 years ago

I love how compact the JmdictFurigana.txt file format is but it certainly is non-standard and not trivial to parse correctly. I wonder if you've thought about distributing the data as some kind of JSON?

I have a package that parses the compact JmdictFurigana.txt into a line-delimited JSON file, whose lines are for example:

{"text":"アカガエル科","reading":"アカガエルか","furigana":["アカガエル",{"ruby":"科","rt":"か"}]}
{"text":"給料明細","reading":"きゅうりょうめいさい","furigana":[{"ruby":"給","rt":"きゅう"},{"ruby":"料","rt":"りょう"},{"ruby":"明","rt":"めい"},{"ruby":"細","rt":"さい"}]}

Each line is valid JSON, with the following schema (in TypeScript notation, so with type X, X never shows up in the generated JSON):

type Ruby = {
  ruby: string,
  rt: string,
};

type Furigana = string|Ruby;

type Entry = {
  furigana: Furigana[],
  reading: string,
  text: string,
};

I use ruby/rt to match HTML.

This particular line-delimited JSON format expands the 8.7 MB original to 24 MB, but gzip compression means they're 2.5 MB versus 3.8 MB respectively over the wire. I can imagine replacing the Entry schema above with a simpler array-based one, something like type Entry = [string, string, Furigana[]] if we wanted to reduce filesize, or imitate the current JmdictFurigana.txt format.

Feel free to say no if you've thought about this and didn't want to support it! I've parsed the file in three languages now (JavaScript, Clojure, and again TypeScript/JavaScript) and it's sufficiently tricky that I thought I'd ask. Thank you!

Doublevil commented 5 years ago

I had thought about it but deemed it too wasteful in terms of file size. Seeing how well it compresses, and looking back on it overall, I would say I did not make the right decision. I will be looking into distributing it in both formats, with the schema you used in your package, if that's okay.

fasiha commented 5 years ago

Wow, thank you, that would be most kind!

I looked at an array-based encoding for JSON that might be a bit more compact, e.g.,

["アカガエル科","アカガエルか",["アカガエル",{"ruby":"科","rt":"か"}]]
["給料明細","きゅうりょうめいさい",[{"ruby":"給","rt":"きゅう"},{"ruby":"料","rt":"りょう"},{"ruby":"明","rt":"めい"},{"ruby":"細","rt":"さい"}]]

and the resulting line-delimited JSON is 19 MB and gzips to 3.6 MB. I'm not sure if such a small bit of space-saving is worth going from a fully-expressive key-value format I described initially (24 MB → 3.8 MB) to a slightly more obfuscated array format. However, this array-based format may be preferable if you don't want to think of generic/multilingual names for the three fields, i.e., what I called text, reading, and furigana above (I've been told naming things is one of the two hardest problems in computer science, along with cache coherency and off-by-one-errors).

Again, many thanks for your hard work!

Doublevil commented 5 years ago

An update on this: I don't like having unnamed values in a Json file, as I think it makes it harder to both understand and parse. I'm thinking on using your first schema, but having the "furigana" array contains only objects, that may or may not have a "ruby", but will always have an "rt".

A representative example would be:

{
  "text": "大人買い",
  "reading": "おとながい",
  "furigana": [
    {
      "ruby": "大人",
      "rt": "おとな"
    }, {
      "ruby": "買",
      "rt"": "が"
    }, {
      "rt": "い"
    }
  ]
}

As you can see, the kana only parts would have only rt and no ruby.

@fasiha What do you think? Does that make sense to you? Should we rename stuff to be more coherent with this approach?

fasiha commented 5 years ago

That mostly looks good to me! The only thing that gives me pause is—in the HTML spec, the <rt> tag has meaning only inside the <ruby> tag, so I wonder if someone who is familiar with HTML will be a bit confused when they see {"rt": "..."} as meaning “plain non-Ruby text”?

If we wanted to prevent such confusion, and wanted to put kana-only segments inside an object, maybe a totally separate key, e.g., {"plain": "..."}, might be more appropriate?

Sorry, I don’t have great answers, but thank you for spending so much time and energy seeking the best solution!

fasiha commented 5 years ago

Or actually, putting kana-only in {"ruby":"..."} might be fine too!, since you can have ruby without anyrts inside: <ruby>ruby</ruby> without <ruby>any<rt>rts inside</rt></ruby>.

That is, the best solution might be exactly your suggestion but replacing rt key for plain-kana with ruby. It'd be the exact same workflow for clients of your project, but more in keeping with the HTML spec.

Doublevil commented 5 years ago

Yes, the confusion with rt was my main concern as well. I agree that having plain kana in the ruby field would be more appropriate. So let's go with that. Thanks for your help. 👍

Doublevil commented 5 years ago

And it's done. The resulting (gzipped) json files are available in the latest release. Of course don't hesitate to open a new issue if you notice something wrong. Thanks again.

BlueRaja commented 5 years ago

Thank you for continuing to support this awesome project. I and other silent observers appreciate it!