LaurensWeyn / Spark-Reader

A tool to assist non-naitive speakers in reading Japanese
GNU General Public License v3.0
31 stars 7 forks source link

Edict entry ID is parsed incorrectly #13

Closed wareya closed 7 years ago

wareya commented 7 years ago

This is a mild issue since it basically breaks compatibility with any files that store edict IDs directly. If this is going to get fixed it should be fixed ASAP before spark reader becomes too popular.

According to the format page: http://www.edrdg.org/jmdict/edict_doc.html#IREF02

The field has the format: EntLnnnnnnnnX. The EntL is a unique string to help identify the field. The "X", if present, indicates that an audio clip of the entry reading is available from the JapanesePod101.com site.

The current code is something like:

String IDCode = bits[bits.length - 1].replaceFirst("Ent", "")

It should be something like:

String IDCode = bits[bits.length - 1].replaceFirst("EntL", "").replaceFirst("X", "");

Ran into this because I decided to print the ID in the definition popdown and noticed massive negative values, from hashing failed conversions. I did that because I want to make a blacklist for certain surface forms (spellings) to not be allowed to be interpreted as certain Edict entries (e.g. はそう as 半挿).

An alternative is to store the existing interpretation of the ID (where it falls back to a hash) in existing files so that compatibility isn't broken, but to use the real edict id for new stuff like the blacklist I want to make. I wouldn't really like this, but it's a reasonable compromise. You could also detect if preferredDefs has bogus IDs in it and automatically change them to the right version by finding which definition for that word has the same id-string-sans-Ent hash, but that's complicated.

LaurensWeyn commented 7 years ago

Whoops.

I'm surprised it's gone this long without me or someone else noticing it. I guess since this is still a beta it would be fine to change it still. This only affects preferredDefs externally at the moment, which isn't too big a deal.

I'll fix it and add an incompatibility warning.