Open jzohrab opened 11 months ago
Speaking for Korean, if the first two characters are the same, there's a reasonable likelihood of a parent/child relationship.
There is a browser add-on that tries to find similar terms and set a default value for it and mark it differently: https://github.com/geajack/Wordology
There could be a plug-in in Lute that tries to find the lemma of a term, too.
Wordology looks super, thanks for the link :-)
Yes something like a mapping of words, either pre-computed or with a lemma lookup, is close to the idea. The csv import is also a way of specifying a bulk mapping, it's not a terrible way to do it.
Try this: https://github.com/adbar/simplemma
Currently, Lute asks the user to enter in the characters for the parent match. They may be a way to pre-calc things that could possibly be likely parents for a term.
LWT uses an algorithm ostensibly based on http://www.catalysoft.com/articles/StrikeAMatch.html, but I don't know how accurate LWT's code is. That algorithm has a possibly buggy python implementation in https://stackoverflow.com/questions/653157/a-better-similarity-ranking-algorithm-for-variable-length-strings -- there are libraries out there that have this and other algos we could try.
I'm not sure how well this works with things like Japanese (char-based), or with accents -- should accents be "normalized" out of words? Does that even work for languages like Thai or Armenian? I don't know.
Then the algorithm just needs a simple speed check -- e.g if a user (like me) has 100K+ terms in a db, does it respond reasonably quickly with the highest matches?