LuteOrg / lute-v3

LUTE = Learning Using Texts: learn languages through reading.
https://luteorg.github.io/lute-manual/
MIT License
498 stars 46 forks source link

Investigate finding "similar terms" for parent suggestions #74

Open jzohrab opened 11 months ago

jzohrab commented 11 months ago

Currently, Lute asks the user to enter in the characters for the parent match. They may be a way to pre-calc things that could possibly be likely parents for a term.

LWT uses an algorithm ostensibly based on http://www.catalysoft.com/articles/StrikeAMatch.html, but I don't know how accurate LWT's code is. That algorithm has a possibly buggy python implementation in https://stackoverflow.com/questions/653157/a-better-similarity-ranking-algorithm-for-variable-length-strings -- there are libraries out there that have this and other algos we could try.

I'm not sure how well this works with things like Japanese (char-based), or with accents -- should accents be "normalized" out of words? Does that even work for languages like Thai or Armenian? I don't know.

Then the algorithm just needs a simple speed check -- e.g if a user (like me) has 100K+ terms in a db, does it respond reasonably quickly with the highest matches?

jamesdeluk commented 10 months ago

Speaking for Korean, if the first two characters are the same, there's a reasonable likelihood of a parent/child relationship.

GrimPixel commented 9 months ago

There is a browser add-on that tries to find similar terms and set a default value for it and mark it differently: https://github.com/geajack/Wordology

There could be a plug-in in Lute that tries to find the lemma of a term, too.

jzohrab commented 9 months ago

Wordology looks super, thanks for the link :-)

Yes something like a mapping of words, either pre-computed or with a lemma lookup, is close to the idea. The csv import is also a way of specifying a bulk mapping, it's not a terrible way to do it.

GrimPixel commented 9 months ago

Try this: https://github.com/adbar/simplemma