Deinflection should stop at Hiragana/Katakana boundaries

Tomalak commented 3 years ago

I've noticed that Rikaichamp (0.5.12 here) deinflects a bit too eagerly:

Twitter screenshot

捨ててロ will be deinflected as 捨ててろ, although the ロ is actually part of the following word, ローマ.

I think deinflection should generally stop at a Hiragana/Katakana boundary.

birtles commented 3 years ago

This is a bit tricky because we still want to support at least サボっている for example.

For the reverse pattern--starting with hiragana and ending with katakana--apparently some writings use katakana for the okurigana, e.g. https://detail.chiebukuro.yahoo.co.jp/qa/question_detail/q11159730741 which has an example with ためニ but that's not related to deinflection so maybe it's ok.

I see people sometimes putting ます as マス (e.g. https://ameblo.jp/cerise3ss/entry-12677838955.html) just like some people write it as 〼 but these are mostly just for fun and not necessarily something Rikai needs to understand.

I wonder if there's a better way to detect where to stop deflection? It's also a little bit complicated by the fact that we normalize the source text to hiragana before we begin deinflection so we'd need to preserve a bit more information to pass to the deinflection routine.

SaltfishAmi commented 3 years ago

another example

ふとリビングに小気味よい音が響いた。 This is another example for mis-deinflection that I encountered days ago. However I do agree that it's tricky and we need to think twice before making the decision of stopping deinflection somewhere.

Tomalak commented 3 years ago

I see people sometimes putting ます as マス (e.g. https://ameblo.jp/cerise3ss/entry-12677838955.html) just like some people write it as 〼 but these are mostly just for fun and not necessarily something Rikai needs to understand.

Or, a lone Katakana マス could be a hard-coded exception, right along with 〼. I'm not sure if that katakana switch even occurs with any other okurigana, this might be the only such usage.

I's also a little bit complicated by the fact that we normalize the source text to hiragana before we begin deinflection

In that case it should be part of the normalization step of course. :)

One could say:

If we encounter katakana (after hiragana/okurigana), also read the katakana until they end.
Single the katakana out from the match candidate and try and find a stand-alone Katakana word entry, such as "ローマ"
If found, remove them from the match candidate
If not found, proceed as is currently done: normalize to hiragana, try to find longest match

or:

If we encounter katakana (after hiragana/okurigana), also read the Katakana until they end.
Normalize and try to deinflect as currently, but assume that we need a complete match, i.e. do not try to drop the "ス" from "マス" during deinflection.
If found, proceed
If not found, drop the entire katakana suffix from the match candidate and try again

melink14 commented 3 years ago

One comment I would add is it seems much much worse to miss out on a deinflection that was intended than to get an extra match that was spurious. Don't forget that even without the katakana problem it's not uncommon for longer conjugations/words to match incorrectly.

That being said, I wonder if it would be possible to deprioritize such probably mismatches...

Tomalak commented 3 years ago

it seems much much worse to miss out on a deinflection that was intended than to get an extra match that was spurious

True, but there are a few things that are uncommon enough to rule them out. Okurigana switching from hiragana to katakana half-way through in regular (non-slang, non-joking) text is one of those things.

Instead of erring on the side of caution in these cases, it seems smarter to me to take the hint - after all, switching from hiragana to katakana is a common pattern to separate words from one another in written form.

SaltfishAmi commented 3 years ago

        I am still strongly against the idea of completely eliminating those probable mismatches, but I think deprioritizing is a great idea! Non-regular, non-common text is very frequently encountered in my situation after all.SaltfishAmi Sent from phone---- On Tue, 22 Jun 2021 22:36:43 +0800  ***@***.******@***.***> wrote ----

it seems much much worse to miss out on a deinflection that was intended than to get an extra match that was spurious

True, but there are a few things that are uncommon enough to rule them out. Okurigana switching from hiragana to katakana half-way through in regular (non-slang, non-joking) text is one of those things. Instead of erring on the side of caution in these cases, it seems smarter to me to take the hint - after all, switching from hiragana to katakana is a common pattern to separate words from one another in written form.

—You are receiving this because you commented.Reply to this email directly, view it on GitHub, or unsubscribe.

Tomalak commented 3 years ago

Another example, from https://en.wikipedia.org/wiki/Honorific_speech_in_Japanese

where the いた actually belongs to the following word

That's a slightly different issue (no Katakana) and it might be harder to recognize, but a general strategy for both cases could be a kind of "look-ahead" for the next word, as described above.

When pointing the cursor over the bold character, Rikaichamp recognizes:

聞かせていただける → 聞かせていた
聞かせていただける → かせていた
聞かせていただける → せていた
聞かせていただける → ていた
聞かせていただける → いただける

So there is a specific point within the current match candidate, where there suddenly is a longer match for the following word, which cuts the current match candidate short, just as with 捨ててローマ above. This specific point always is on a kana character.

Since this point invariably occurs towards the end of the match candidate, it would make sense performance-wise to search for it from behind.

birtles commented 3 years ago

Thanks! I can see what you mean about doing subsequent searches to detect if there is a natural break. However, even when searching from the end I'm concerned performance will be an issue.

I did some profiling of performance in Chrome recently (which, surprisingly, seems to be slower than Firefox for quite a few cases), and even just the additional name dictionary lookup we do in order to provide a name preview adds quite a bit to the total lookup time.

There's also a very strong relationship between the length of the input string and the total lookup time. I previously thought the IPC overhead, normalization, etc. would somehow flatten the relationship but it's almost a 1:1 relationship between input string length and lookup time.

Which is all to say that doing extra lookups could be tricky unless we can really narrow it down.

birtles commented 3 years ago

Came across another one today: ダウンロードしていただきありがとうございます。

There's no hiragana/katakana break there so using the script as the heuristic isn't going to help with this particular case. @Tomalak's suggestion of searching from the end for longer matches would work here. Maybe there's some way we can really narrow down the scope of what we lookup.

For example, we could identify the deinflections that are likely to overreach (maybe it's all of them, other than the masu-stem one?), then determine the length we need to search back before we can be confident we're not overreaching etc.

Tomalak commented 3 years ago

Naively spoken, all inflections that also occur as the start of a headword (like the 〇〇ていた and いただく case) would fall into that category. It might very well be that there is significant overlap, but the search space is limited, so to get a general idea, it could be brute-forced.

Maybe there also is a way to improve lookup times in general, to give more buffer for fancier heuristics.

birtles commented 3 years ago

Currently we have a flat-file copy of the word database we load into memory when (a) the IndexedDB database is being updated, or (b) the IndexedDB database is not available for some reason. The lookup times for the in-memory database are much better but it comes at the cost of using up a lot of memory. It's also harder to update, slower to load (esp. when we are running as an event page), prone to load errors, and doesn't include non-English glosses.

We could make a slimmed-down version of the flat-file database that includes only the normalized headwords & parts of speech, and only for verbs that we load into memory for refining deinflections. It's a bit of work and would generally lag behind the IndexedDB data giving the wrong results sometimes, but it's probably sufficient for the sake of eliminating invalid deinflections. If we can fit that into < 10Mb of memory it might be acceptable.

Tomalak commented 3 years ago

No, I meant brute-forcing as a one-off to see how much overlap there is, not as a regular activity during look-up. :) Maybe there is a way to whittle it down to a very limited list of endings that need a second look during regular processing, and hard-code that.

birtles commented 3 years ago

No, I meant brute-forcing as a one-off to see how much overlap there is, not as a regular activity during look-up. :) Maybe there is a way to whittle it down to a very limited list of endings that need a second look during regular processing, and hard-code that.

Yes, sorry, I was responding to your final comment about improving lookup times in general.

Tomalak commented 3 years ago

Here's another one with たら and らしい. This also shows a compound verb. (source)

birchill / 10ten-ja-reader

Deinflection should stop at Hiragana/Katakana boundaries #641