birchill / 10ten-ja-reader

A browser extension to translate Japanese by hovering over words.
https://addons.mozilla.org/firefox/addon/10ten-ja-reader/
GNU General Public License v3.0
598 stars 45 forks source link

Unable to recognize composite verb with some (but not all) kanji written in kana #302

Open SaltfishAmi opened 4 years ago

SaltfishAmi commented 4 years ago

つれこむ and 連れ込む can be recognized as one verb, but either 連れこむ or つれ込む could not.

This behavior seems reasonable as we don't want it to read 沢さん as 沢山, but for verbs especially simple composite verb components like 込む in 連れこむ, 抜ける in 駆けぬける, there is a good chance a Japanese would write them in kana.

On the other hand, 使い切る and 使いきる can both be recognized correctly. Maybe there're both entries in the dictionary?

birtles commented 4 years ago

Yes, that's right. Many verbs have both entries in the dictionary. e.g. 片づける and 片付ける.

It would be pretty tricky to recognize all the variations automatically. Handling just the first case, 連れこむ, might be possible with some pretty involved rules such as:

  1. One of the longest matches is a masu-stem and the next (unmatched) character is hiragana
  2. There is a longer entry in the database with the same masu-stem followed by kanji
  3. When converting the database entry to hiragana, the next character(s) match the unmatched hiragana for one of the generated deinflections

It's pretty involved and whether or not a composite verb ends up in the dictionary is already quite arbitrary so I'm not sure how important this would be.

nicolasmaia commented 4 years ago

Whenever I stumble upon this problem, I tend to propose the new forms at JMdict instead.

In this case, 連れこむ has reasonable ngrams (645), but つれ込む has only 38, so I'm going to propose adding the first one.