arianneorpilla / jidoujisho

A full-featured immersion language learning suite for mobile.
GNU General Public License v3.0
943 stars 58 forks source link

fix: improve immersionkit cloze for compounds #357

Closed m-edlund closed 7 months ago

m-edlund commented 7 months ago

Improved the calculated cloze of the immersionKit enhancement for long terms such as: いざとなれば currently only finds: いざ 小細工 currently only finds: 禁断症状 currently only finds: 禁断 心境の変化 currently only finds: 心境

It works by finding the longest substring of the search term in the given string and taking that as the cloze. In cases where there is an alternative spelling for the same reading, the original method is used as a fallback. This does not solve cloze detection for cases in which there is an alternate spelling and the original method fails to find the entire cloze.

Take for example the search term 度し難い. If the example sentence contains 度しがたい the cloze will be only 度し as this is the longest span either of the two methods have found.

m-edlund commented 7 months ago

I've found a couple more ways to improve the cloze finding. By using the wordList, the cloze is extended with new words from the sentence until it is at least as long as the word. This way it should be better with dealing with cases such as: 度し難い even when it is written as 度しがたい. Additionally it will now also provide a correct cloze for cases where the kanji in the term and sentence are two different variants.

Ichidan verbs are now also considered. In this case the term can be longer than the cloze by one character and still valid. For example for words such as 比べる, immersionkit will also find sentences with 比べ, which should also be clozed correctly. But to ensure that this not done every time, the next word is checked to see if it is either a conjugation for ichidan verbs, or godan verbs ending in る (I think I got most of them, but might still have missed one or two).

Godan verbs are also considered, as they can have the situation that they are conjugated and immersionkit sets the word boundary between and say . This is also considered. In this case we simply take another word.

I just put the list of conjugations at the start of the file, let me know if there is a better place for them.

I probably still forgot one or two edge-cases, but this should be an improvement for most of the clozes that required manual editing before.

m-edlund commented 7 months ago

Final edit, now most godan verb conjugations are properly covered. Some edge cases obviously still exist (also because immersionkits tokenization is slightly off in some situations), but in those cases it will just default to the stem of the word. Additionally I moved out the conjugations to their own file as they would have cluttered the immersion kit file otherwise. If I should move it lmk.