GlobalMaksimum / sadedegel

A General Purpose NLP library for Turkish
http://sadedegel.ai
MIT License
92 stars 15 forks source link

Add spelling correction module [resolves #190] #213

Open askarbozcan opened 3 years ago

askarbozcan commented 3 years ago

As title says, added a spelling correction module which utilizes SymSpellPy at the backend and as the vocabulary for spelling correction, approximately 450k~ term frequency vocabulary has been created by merging OpenSubtitles (Turkish) and Turkish Wikipedia data.

Currently only utilizes only one of the possible ways to use SymSpellPy, namely using its lookup_compound() method which is not necessarily the best way to correct spelling.

The module is integrated as such:

d = Doc("Ali bubanın çiftliği")
d_fixed = d.get_spell_corrected()
print(d_fixed) # "Ali babanın çiftliği

TODO:

Dataset to test on: https://github.com/StarlangSoftware/Dictionary/blob/master/src/main/resources/turkish_misspellings.txt

EDIT: Result (best) max_edit_distance = 2 Accuracy: 51% Most of the mistakes were in words with wrongly placed (or omitted) Turkish umlaut-letters: ex: "yuzulmuyor" was fixed as "duyulmuyor" when it should have been "yüzülmüyor"

Two (orthogonal to each other) ways to bring accuracy to 90%+: 1) Prioritize fixing wrongly placed Turkish characters first. 2) Use FastText embeddings to pick the best candidate based on semantic meaning of the word and its neighbours.

These improvements are left to other PRs as this PR is already getting a bit too large.

resolves #190

askarbozcan commented 3 years ago

Note to self: Modify the symspellpy distance calculation in such a way that changing Turkish umlaut-characters to English counterparts (ü -> u, ç->c) and vice versa (u -> ü, c -> ç) has a smaller edit distance compared to changing any other characters.

EDIT: After a thorough reading of SymSpellPy's source code it is pretty much impossible to overload symspellpy's distance without rewriting the whole distance calculation itself with Turkish character equivalency in mind.

An approach of simply generating all possible combinations of Turkish umlauts in a word and finding the correction among them with the smallest edit distance (thus simulating Turkish character equivalency) has yielded around %58 accuracy however due to all the possible combinations it was way too slow, so was scrapped.

For now the only method umlauts will be compensated is by comparing its "flipped" version (aka when "yuzuyorum" is looked up, "yüzüyörüm" is also looked up).

askarbozcan commented 3 years ago

As an extra note, see this: https://towardsdatascience.com/spelling-correction-how-to-make-an-accurate-and-fast-corrector-dc6d0bcbba5f