FooSoft / yomichan

Japanese pop-up dictionary extension for Chrome and Firefox.
https://foosoft.net/projects/yomichan
Other
1.04k stars 203 forks source link

Yomichan shouldn't prioritize exact match over frequency. #1669

Open epistularum opened 3 years ago

epistularum commented 3 years ago

1) The frequency of de-inflected verbs/adj should be properly taken into account. When looking up 歩き chances are you actually want to see the definition of 歩く first, that is why the frequency of 歩く is way higher than the frequency of 歩き and tthat order should be respected when displayed within yomichan. 2) It gets quite difficult (or near impossible) to find the deconjugated match under multiple "exact matches". Names, for instance. In my case, I have to go through 29 entries in order to finally find 落ちる when looking up おち.

Here are some examples, I've excluded more extreme examples that would result in images ridiculously long: image image

image image

toasted-nutbread commented 3 years ago

You seem to be having two different issues here:

  1. Exact matches appearing before deinflected matches (you would see the same thing if the score was identical for 歩き and 歩く).
  2. Names appearing before "more meaningful" definitions.

I would argue that 1 is the correct behaviour, because how do we know the user doesn't want to see 歩き instead of 歩く? 歩き has the additional noun meaning which could be correct for the context. Compare vs Jisho, which also doesn't list 歩く at the top. And while maybe this is a contrived example, a learner should also be able to intuit that 歩き is a form of 歩く from both the raw text and the definition.

2 is probably the same issue as #105, and you can improve this by decreasing the priority of the names dictionary.

Thermospore commented 3 years ago

Yeah I just moved jmnedict to a separate profile so I didn't have to flip through stacks of names when looking for a word

ttu-ttu commented 3 years ago

I was thinking maybe provide an option in the settings to prioritize deinflected form over the inflection, and I think it makes sense because in J-J dictionaries, 90% of the time they will ask us do refer to the base (deinflected form).

Another way to deal with this is to place the deinflected form right below the exact match, also controlled by settings of course since I believe it's more of a user preference

epistularum commented 3 years ago

I would argue that 1 is the correct behaviour, because how do we know the user doesn't want to see 歩き instead of 歩く?

I believe this should be handled by the freq information. For instance, 歩き has a freq of 2 while 歩く has a freq of 601. This freq information is taken from the provided jmdict dict. On most instances I believe it makes more sense showing the de-inflected form but it is true that sometimes the conjugated form is way more frequent than the unconjugated one. ex: 物思い vs 物思う. That is why I think we should rely on the freq indicator since it can differentiate between the two. Having a toggle like ttu-ttu explained is also another idea worth looking into but it is not as granular as what I explained above.

On another note, where does this freq info come from? I can't seem to find it in the jmdict file itself.

2 is probably the same issue as #105, and you can improve this by decreasing the priority of the names dictionary.

I already have my name dictionary on the lowest priority compared to my other dicts. That is why I believe yomichan displays direct matches higher than deconjugated matches. In this example, all the names are considered as a direct match since the looked up text is in phonetic while 食べる need to be de-conjugated and would be considered as an indirect match. At least, that is what my understanding of the behaviour is.

toasted-nutbread commented 3 years ago

Another way to deal with this is to place the deinflected form right below the exact match

This information isn't store in the dictionaries that Yomichan imports, and I'm not sure it would be safe in the general case to assume what is and isn't an inflection.

That is why I think we should rely on the freq indicator since it can differentiate between the two.

To clarify: by "freq" do you mean the score for a definition, the green frequency tags, or something else?

On another note, where does this freq info come from? I can't seem to find it in the jmdict file itself.

https://github.com/FooSoft/yomichan-import/blob/83e3e44f46e344bfe66d9c7181caa5b113f8fb2a/edict.go#L160 https://github.com/FooSoft/yomichan-import/blob/83e3e44f46e344bfe66d9c7181caa5b113f8fb2a/edict.go#L48-L65

I already have my name dictionary on the lowest priority compared to my other dicts. That is why I believe yomichan displays direct matches higher than deconjugated matches.

Yeah, I see what you mean now; this issue affects kana-only searches moreso than kanji definitions. There is also some discussion in #1539 about updating how dictionary priority is handled internally, and this may fall into that category as well.


For reference, this is the current code for sorting dictionary entries:

https://github.com/FooSoft/yomichan/blob/e7d349c3ec75f61bab09035226921968f7423741/ext/js/language/translator.js#L1186-L1228