Open askarbozcan opened 3 years ago
Note to self: Modify the symspellpy distance calculation in such a way that changing Turkish umlaut-characters to English counterparts (ü -> u, ç->c) and vice versa (u -> ü, c -> ç) has a smaller edit distance compared to changing any other characters.
EDIT: After a thorough reading of SymSpellPy's source code it is pretty much impossible to overload symspellpy's distance without rewriting the whole distance calculation itself with Turkish character equivalency in mind.
An approach of simply generating all possible combinations of Turkish umlauts in a word and finding the correction among them with the smallest edit distance (thus simulating Turkish character equivalency) has yielded around %58 accuracy however due to all the possible combinations it was way too slow, so was scrapped.
For now the only method umlauts will be compensated is by comparing its "flipped" version (aka when "yuzuyorum" is looked up, "yüzüyörüm" is also looked up).
As an extra note, see this: https://towardsdatascience.com/spelling-correction-how-to-make-an-accurate-and-fast-corrector-dc6d0bcbba5f
As title says, added a spelling correction module which utilizes SymSpellPy at the backend and as the vocabulary for spelling correction, approximately 450k~ term frequency vocabulary has been created by merging OpenSubtitles (Turkish) and Turkish Wikipedia data.
Currently only utilizes only one of the possible ways to use SymSpellPy, namely using itslookup_compound()
method which is not necessarily the best way to correct spelling.The module is integrated as such:
TODO:
Keep term frequency vocabulary in a bucket instead of LFS(no need)Customizable spelling correction (configs/overriding spelling correction class?)(can be added in another PR)(Currently basic mode tests fail)Dataset to test on: https://github.com/StarlangSoftware/Dictionary/blob/master/src/main/resources/turkish_misspellings.txt
EDIT: Result (best) max_edit_distance = 2 Accuracy: 51% Most of the mistakes were in words with wrongly placed (or omitted) Turkish umlaut-letters: ex: "yuzulmuyor" was fixed as "duyulmuyor" when it should have been "yüzülmüyor"
Two (orthogonal to each other) ways to bring accuracy to 90%+: 1) Prioritize fixing wrongly placed Turkish characters first. 2) Use FastText embeddings to pick the best candidate based on semantic meaning of the word and its neighbours.
These improvements are left to other PRs as this PR is already getting a bit too large.
resolves #190