LaurensWeyn / Spark-Reader

A tool to assist non-naitive speakers in reading Japanese
GNU General Public License v3.0
31 stars 7 forks source link

Text in brackets being removed from Edict definitions #20

Closed LaurensWeyn closed 7 years ago

LaurensWeyn commented 7 years ago

I imagine this has to do with the new tag parsing system. I noticed this mainly because all my custom definitions for names end in (first name) or (surname), which no longer displays. Even then, EDICT uses brackets for a lot of non-tag things too. Entries are numbered if more than one exists, and extra comments are often included as well, or info on special cases etc.

This is sort of the result of the annoying way that EDICT2 is formatted... One of the reasons I plan to switch this over to the more detailed XML version at some point. Not a big issue since it will be replaced anyway, but just pointing it out.

Also, the user dictionary saving works, but it's not adding it to the data structure. Could probably easily be fixed by going to the dictionary, removing all entries from the 'custom' source, and then adding them all back on clicking save changes. And since Swing isn't thread safe, it doesn't like the definition table being updated by another window, though again this UI may be seperate from the Settings screen and perhaps redone for things like VNDB name import later.

wareya commented 7 years ago

I think some word had blank definitions when I tried to look it up because of this. Yeah, it's annoying, you can't tell whether something in brackets is a tag, a restriction to a particular spelling/reading, or just a note without having the code check one at a time, and I forgot to bother to do that because it already got a little messy.

LaurensWeyn commented 7 years ago

Working on the JMDict conversion, I'm surprised I didn't notice the missing brackets sooner; there's a lot of useful information there.

This also broke definition export line splitting, and all text in brackets from the user dictionary file after saving. Instead of trying to fix it, I thought to get going on the JMDict parser.

I've made a new branch for this since it changes a lot of stuff (Is it a good idea to upload near 100MB files to GitHub? Probably not...), and has all the metadata loaded/stored properly this time, though it's not all in use yet. I have yet to port things like your FrequencySink though.

wareya commented 7 years ago

Oh wow, I had no idea it broke so much stuff. Sorry about that.

The FrquencySink stuff just tries to find a valid combination of spelling and reading(katakana) in the frequency data sink. It's a simple idea, but the code is gross. I can fix it once you decide the JMDict functionality is ready.

LaurensWeyn commented 7 years ago

Don't worry, not too big a deal. Mostly the fault of me too lazy to set up tests for anything except the simplest of things.

wareya commented 7 years ago

Regression testing is hard.

wareya commented 7 years ago

I think there should be an option for the definition export feature to prefer exporting kanji. This was one of the ideas behind associating individual spellings and readings for definitions. This can be done once the JMDict functionality is all done, since you could get the exact most preferred (first) kanji valid for a given reading, if the word was written in kana. With the old EDict functionality there's no way to make sure the exporter is only looking at kanji that are okay for that reading, since parsing the brackets turned out to break things.

LaurensWeyn commented 7 years ago

I think the JMDict implementation is fairly stable now and ready for the master branch. The biggest issue with it right now is the relevance/sorting system, which needs some tweaking. I could've emulated the EDict2 'P' tag approach but I went for a new scoring system that should be better in the long run.

With that, this bug is mostly fixed, except for my frustrations with the user dictionary editor not updating internally, or not saving to a file, or both. On the bright side, this has given me motivation to get working on that VNDB importer, which I should start working on hopefully this wednesday or sooner.