Doublevil / JmdictFurigana

A Japanese dictionary resource that attaches furigana to individual words
150 stars 13 forks source link

Include JMdict sequence numbers? #17

Closed fasiha closed 3 years ago

fasiha commented 3 years ago

I'm not sure if you saw this—I mentioned JmdictFurigana on the EDICT-JMdict mailing list, in the context of assigning furigana to entries, and Jim Breen asked

I looked at the JSON version of the JMdict-with-furigana, sort-of expecting that the JMdict sequence numbers would be included but they're not. Is there any way they can be added. It's a challenge to align it with JMdict without them - is 川柳 せんりゅう or かわやぎ?

(It looks like you'll need a Google account and access to the group to see the thread online?)

I'm not sure how difficult this might be, so I thought to make an issue at least to track the request. The author of https://github.com/Doublevil/JmdictFurigana/pull/16 also chimed in on that thread so you might have some help 😁. Thanks as always!

Doublevil commented 3 years ago

Of course they could be added, but I'm not sure why they would be needed in the first place. Json entries have both a "text" field (the writing with kanji) and a "reading" field (the writing in hiragana).

So to use the same example, if you have this entry:

{
  "text": "川柳",
  "reading": "せんりゅう",
  "furigana": [
    (...)
  ]
}

You know this entry is for the せんりゅう reading of 川柳.

Am I missing something?

Doublevil commented 3 years ago

No answer, so I'm closing the issue. Please re-open if there are news on this.

CameronChambers93 commented 3 years ago

Hello, I forwarded your concern to the mailing list. Jim wrote this back in response:

My interest was to add the derived furigana as an option to the WWWJDIC server. To do this I'd need to align the dictionary entries with the JmdictFurigana data. The alignment would be simple if the sequence numbers were in both streams. Otherwise, I'd have to do a (daily) kanji/reading alignment as I use the latest dictionary version. Not impossible but messy when many readings have multiple kanji forms and readings

As fasiha said, I would be happy to help if this is something you wanted to tackle.

MartinP7r commented 3 years ago

Of course they could be added, but I'm not sure why they would be needed in the first place. Json entries have both a "text" field (the writing with kanji) and a "reading" field (the writing in hiragana).

So to use the same example, if you have this entry:

{
  "text": "川柳",
  "reading": "せんりゅう",
  "furigana": [
    (...)
  ]
}

You know this entry is for the せんりゅう reading of 川柳.

Am I missing something?

I think the point Jim was making related to database query cost. In your structure, there primary key (so to speak) would be (text, reading) I guess. So a query to match up a jmdict entry with the entry in your file would involve comparing those two properties. However, the jmdict entry already has the reading and if it where a normalized database table, it'd probably be just the sequence number and furigana, you would not need that reading for the furigana table and could just query based on the sequence number primary key.

Obviously, if there's other use-cases for this file besides pairing up with jmdict, leaving the (text, reading) primary key makes sense.
In the end it really just depends how the consuming system is going to use your file. If it's possible to export two different json versions (one with ent_seq and one with (text, reading)), that would probably the most convenient solution for everybody.