Open ruohoruotsi opened 5 years ago
@Olamyy nicely points out that
a word2vec/sentence2vec model might word really well here. For every entry(word/sentence) a user inputs, try to find the word in the model vocabulary. If it doesn't exist, either raise an error or get the closest word in the vocab. I suppose fasttext would work well here since it uses subword (ngram) sets. The challenge here might just be the extra step.
Pre-filter words whose non-diacrictized word-forms are not in the dictionary, before asking the model to do ADR. This way we can get more predictable results and error messages for Out-Of-Vocabulary words (OOV)
If the model sees a word like
elerindodo
, validate that this word's diacritic form exists in the dictionary and return an error message if it doesn't! This way, since the model doesn't know aboutelerindodo
, it can just say so, rather than confuse the users by returning the "top probability word" which may be a random thing likealáǹtakùn
!