Add spelling correction module [resolves #190]

askarbozcan commented 3 years ago

As title says, added a spelling correction module which utilizes SymSpellPy at the backend and as the vocabulary for spelling correction, approximately 450k~ term frequency vocabulary has been created by merging OpenSubtitles (Turkish) and Turkish Wikipedia data.

~~Currently only utilizes only one of the possible ways to use SymSpellPy, namely using its lookup_compound() method which is not necessarily the best way to correct spelling.~~

The module is integrated as such:

d = Doc("Ali bubanın çiftliği")
d_fixed = d.get_spell_corrected()
print(d_fixed) # "Ali babanın çiftliği

TODO:

[x] Add more aggressive/softer methods of spelling correction
[x] As a default, load pickled term frequency vocabulary instead of from text (faster loading this way)
[x] Notify user when the dictionary is being loaded (as it takes a few seconds) ~~Keep term frequency vocabulary in a bucket instead of LFS~~ (no need) ~~Customizable spelling correction (configs/overriding spelling correction class?)~~ (can be added in another PR)
[x] Test its performance on dataset shown below and find decent default parameters
[x] Unit tests !
[x] Fix punctuation preservation in "basic" mode when multiple punctuation marks are involved. ~~(Currently basic mode tests fail)~~

Dataset to test on: https://github.com/StarlangSoftware/Dictionary/blob/master/src/main/resources/turkish_misspellings.txt

EDIT: Result (best) max_edit_distance = 2 Accuracy: 51% Most of the mistakes were in words with wrongly placed (or omitted) Turkish umlaut-letters: ex: "yuzulmuyor" was fixed as "duyulmuyor" when it should have been "yüzülmüyor"

Two (orthogonal to each other) ways to bring accuracy to 90%+: 1) Prioritize fixing wrongly placed Turkish characters first. 2) Use FastText embeddings to pick the best candidate based on semantic meaning of the word and its neighbours.

These improvements are left to other PRs as this PR is already getting a bit too large.

resolves #190

askarbozcan commented 3 years ago

Note to self: Modify the symspellpy distance calculation in such a way that changing Turkish umlaut-characters to English counterparts (ü -> u, ç->c) and vice versa (u -> ü, c -> ç) has a smaller edit distance compared to changing any other characters.

EDIT: After a thorough reading of SymSpellPy's source code it is pretty much impossible to overload symspellpy's distance without rewriting the whole distance calculation itself with Turkish character equivalency in mind.

An approach of simply generating all possible combinations of Turkish umlauts in a word and finding the correction among them with the smallest edit distance (thus simulating Turkish character equivalency) has yielded around %58 accuracy however due to all the possible combinations it was way too slow, so was scrapped.

For now the only method umlauts will be compensated is by comparing its "flipped" version (aka when "yuzuyorum" is looked up, "yüzüyörüm" is also looked up).

askarbozcan commented 3 years ago

As an extra note, see this: https://towardsdatascience.com/spelling-correction-how-to-make-an-accurate-and-fast-corrector-dc6d0bcbba5f

GlobalMaksimum / sadedegel

Add spelling correction module [resolves #190] #213