LaurensWeyn / Spark-Reader

A tool to assist non-naitive speakers in reading Japanese
GNU General Public License v3.0
30 stars 7 forks source link

Question: preferred definitions, should they be based on dictionary (deconjugated) form or surface (as it is in text) form? #15

Open wareya opened 7 years ago

wareya commented 7 years ago

Right now they're based on the dictionary form. This makes it a lot faster to set up preferred definitions since you don't have to set it on every conjugation of a given verb you run into, but there are situations where multiple preferred words can end up showing up for the same surface form because of deconjugation. Deconjugation is the only place that the surface form and dictionary form differ, so it seems like an inherent problem to me.

Maybe spark reader could get away with marking definitions as "good" instead of "preferred", so you would have multiple preferred definitions in cases like this, and it wouldn't care about the specific word the definition pops up on. This would also let users basically pull all the definitions they like for a given word to the top of the list, and if you added a "not good" thing too, they could also push definitions they don't like to the bottom. And maybe, just maybe, you could also have actual preferences for specific surface forms, on top of the "good definition / bad definition" thing, which would care about the specific word again.

The issue here is that doing what I described in the above paragraph would make it very complicated to use spark reader effectively, which is why I'm asking about this issue as a question instead of a request.

This is only tangential/unrelated to the blacklist, which is to prevent parsing mistakes like はそう being a single segment, and seems to work well. But the blacklist basically has to be based on the surface form only because that's what it's for, and it feels gross for the blacklist and preferred definitions to use a different version of the same word to determine whether it counts.

For what it's worth, I'm already using my branch with preferred definitions based on the surface form instead of the dictionary form, and it seems to work well, it's just tedious. Being able to mark specific definitions as "good" without caring about the word itself would make it less tedious, but more complicated.

One interesting note is that basing preferred definitions on surface form instead of dictionary form makes preferred definitions work better with changes to the deconjugator since you wouldn't run into a specific possible rare issue: If you mark two definitions as preferred for two different given dictionary forms, then the deconjugator changes and puts both definitions on the same surface form, and that surface form deconjugates to both dictionary forms, it's basically arbitrary which definition shows up first in the list.

This is basically a special case of the "multiple preferred definitions for the same conjugated word" issue, except that it means the behavior of the program changes under the user's feet instead of behaving differently for a word that they didn't set a preferred definition for yet. And if you set a specific one of those definitions as preferred, what happens to the other one? It was valid for its dictionary form, just for a different word. Is it no longer preferred?

LaurensWeyn commented 7 years ago

Perhaps some combination of both is possible?

When the user sets a definition as preferred, both the dictionary form and the conjugated form of the preferred word could be stored. If it shows up conjugated the same way, then clearly they want the definition they chose the last time it showed up. If a word shows up in a different form though, using the preferred definition is at least a good guess as to the best definition.

But if this guess is wrong and the user decides a new preferred definition, we'll need one more piece of information. Will this be the definition for this form only, or the global default?

I haven't thought this through thoroughly, but it seems like a good compromise between accuracy and convenience. It will require a new file format of course, but since preferred definitions are broken due to the hashing problem anyway this isn't a big deal.

wareya commented 7 years ago

Maybe splitting it into two different menu items would be a good idea after all. That way the user doesn't have to worry about what spark reader is doing under the hood.

Even though having lots of menu options is generally a bad thing, being able to say "this is generally a good definition" would reduce a lot of the tedium of teaching spark reader how to sort definitions, and still allow exceptional definitions to be based on the exact spelling (surface form) instead of the deconjugated form, without worrying about which order you encounter them in or how spark reader decides what's what.

Since the file format is changing anyway I could see that happening, one file for good definitions, one for preferred exceptions.

LaurensWeyn commented 7 years ago

I was thinking more along the lines of having a popup message appear if the definition conflicts, so the user only needs to worry about the details of how things will be preferred once it's relevant. That, or perhaps only show the other options in the right click menu if a definition already conflicts. It sort of already does this for some things, e.g. lookup text will only show up in the right click menu when right clicking a line with Japanese in it (though I realize it's probably better UI practice to grey out the option instead of hiding it)

I'm no good at UI design, as is quite clear by most of the feedback I'm getting from Reddit PMs being about the UI, so I may not be the best person to ask on the nicest way to show this to the user, but it's the neatest I can come up with.

wareya commented 7 years ago

There's also the issue where sometimes a single word has multiple valid definitions but they're way down in the list. This happens a lot with normal words spelled in kana. Someone who just wants to bring those definitions to the top of the list would probably look for a way to tell spark reader "this definition is good" instead of "set this definition as the default".

The simple way to handle that would be to make it so that setting a definition as the default would bring it to the top of the list directly (as far as user-facing behavior is concerned), but that idea doesn't work well when you have definitions matching different spellings or when you have deconjugation. I guess this is why I think having some kind of "this is a good definition" button is a good idea, in addition to a "I want this definition to be at the top of the list for this specific word" button.

Right now, the preferred definition thing does double duty as a way of marking exceptions (e.g. おっと is "oops", not 夫) and as a way of marking good definitions (良い and 言い are better interpretations of いい than 易々). You could give it some internal logic so that it's better at doing both, but I think it might be better UX if they're presented as different ideas by having different buttons. If you want, I could test this out in my kuromoji branch, since I already made the existing preferred definitions thing work this way.

LaurensWeyn commented 7 years ago

Good point...

I suppose a 'good definition' in general is another level of preferable definitions. So the 3 levels are, as far as I understand:

  1. The definition is 'good' for all of its spellings
  2. The definition is 'good' for a specific spelling
  3. The definition is 'good' for a specific conjugation of a specific spelling

As things are now, 'preferred definitions' store the second level. The first level on its own seems like it might conflict with kana words easily, and the third level would be too much work for the user. Having all 3 of those on the UI might be a bit much, perhaps either the first or second could be left out without causing too much loss in functionality.

I suppose you're suggesting to provide visible options for storing the first and third, and not deal with the second.

I'll need to think about how this would work best a bit more.

wareya commented 7 years ago

I suppose you're suggesting to provide visible options for storing the first and third, and not deal with the second.

Yep, that's right.

This is a hard problem, so take your time.

LaurensWeyn commented 7 years ago

I'm still indecisive on this. Perhaps some sort of option for a more advanced mode could exist, but keeping that compatible with the basic mode...

While I'm at it with this change, the new preferred definition file format should also store preferred readings, since right now it just uses the first kana one it finds (which does work surprisingly well).

And on a related note, I plan on actually releasing 0.7 soon so I can work on some major UI changes for 0.8. With that release, I'm going to temporarily "break" the Edict ID parsing so it's compatible again, so users won't need to have their preferred definition list cleared twice in 2 consecutive versions.

If you have any deconjugator fixes, you may want to add them soon.

wareya commented 7 years ago

Sounds good. I'll look at what I have.

Perhaps some sort of option for a more advanced mode could exist, but keeping that compatible with the basic mode...

The basic mode could do good+preferred, with only one preferred definition per exact spelling. It could show a "half selected" icon (as opposed to unselected or fully selected, e.g. blank -> box -> checkmark) when a definition is good but not preferred, and clicking it could only set it to "blank" (neither good nor preferred) nor "checkmark" (both good and preferred), with other good definitions being set to the halfway state. I think this is similar in spirit to what you thought of before, though I don't remember exactly what you thought of and don't have the state of mind to reread the conversation.

wareya commented 6 years ago

I've been thinking about this, and I think I've changed my opinion. Preferred definitions should be based on whatever links them to the definition, which is the deconjugated form. And I don't think SR needs a "this is a good definition" option because frequency information is a thing now, and that there's a per-spelling definition blacklist for totally insane relationships (は in kana -> 齒).

A bigger concern to me now is how known words marking works, which I think should be based on definition instead of spelling. Maybe with an option to use a different color for words that have alternate definitions you don't know, but are defaulting to a definition that you do know. After I started mining with the definition export feature, I usually try to mark definitions as known instead of words.

By the way, there should be an option for the definition export feature to prefer exporting kanji. I'll post this on the JMDict issue.