Niger-Volta-LTI / iranlowo

Ìrànlọ́wọ́ is a utility library for analysis & (pre)processing of Yorùbá text → https://pypi.org/project/iranlowo
MIT License
17 stars 8 forks source link

Pre-filter words whose diacrictic forms are not in the dictionary #15

Open ruohoruotsi opened 5 years ago

ruohoruotsi commented 5 years ago

Pre-filter words whose non-diacrictized word-forms are not in the dictionary, before asking the model to do ADR. This way we can get more predictable results and error messages for Out-Of-Vocabulary words (OOV)

If the model sees a word like elerindodo, validate that this word's diacritic form exists in the dictionary and return an error message if it doesn't! This way, since the model doesn't know about elerindodo, it can just say so, rather than confuse the users by returning the "top probability word" which may be a random thing like aláǹtakùn!

ruohoruotsi commented 5 years ago

@Olamyy nicely points out that

a word2vec/sentence2vec model might word really well here. For every entry(word/sentence) a user inputs, try to find the word in the model vocabulary. If it doesn't exist, either raise an error or get the closest word in the vocab. I suppose fasttext would work well here since it uses subword (ngram) sets. The challenge here might just be the extra step.