Spelling Correction Test Results

I continued testing and analyzing Spelling Correction module as explained in pull request #213. There was a lack of dataset to test this module's performance and in order to overcome this issue 100 tweets with spelling mistakes were collected and hand corrected. Then in order to measure similarity of texts a Levenshtein score function is implemented. Right now the Spelling Correction module works as follows:

d = Doc(str)
correct_d = corrector.correct_doc(d) 
levenshtein_dist(d,correct_d)

An example of this is shown below:

"kilgi verir misin" --> "bilgi verir misin" with Levenshtein score = 1

Initial error analysis is done using the mean of this metric (Mean Levenshtein Score). The results of the initial analysis are:

Tweet Type	Levenshtein Score
Original - Hand Corrected	6.82
Original - Code Corrected	3.23

This shows that the code corrected tweets are not as different from the original as the hand corrected tweets. Then another comparison is done for showing this from a different point of view. This showed that the code corrected ones are not getting closer to hand corrected ones but rather diverging.

Tweet Type	Levenshtein Score
Original - Hand Corrected	6.82
Code Corrected - Hand Corrected	7.98

Overall other results of this initial analysis are:

best_max_edit_distance = 0 worst_max_edit_distance = 24

MSE: 42.23 R2 score: 0.0093

As there were both perfectly corrected and not very well corrected texts, I decided to further analyze this issue by checking each data point and detecting their issues for not performing well. This resulted in the following issues with the Spelling Correction Module. The issues below are listed from most common to least common:

Not fixing due to not knowing the Grammar rules (de, ki, mi, her şey, hiçbir...)
Not fixing due to not knowing the word (such as slang, capital letters, names) + small vocabulary issues
Not fixing due to word being meaningful on its own (bir --> biri)
Not fixing words with accents (İdare -> Idare, gun, cok) This is especially common if a word has more than 1 spelling issues
Not fixing due to wrong punctuation (gibi.deli, 3.6 --> 36)
Not fixing due gap between two meaningful words (gök yüzü)
Making wrong translations
Some tweets had their hashtags removed so this effected the results as well
Not fixing repetitions
Not fixing words with turkish-english mix (tweetlerine)
Not fixing ikilemeler
Not fixing due to emoji issues

Some of these errors could not be fixed easily but in order to fix some of it I lowercased all words, cleared punctuation, hashtags, links, repetitions (using pull request #277) and emojis. The results after these changes are shown below:

Tweet Type	Levenshtein Score
Original - Hand Corrected	4.73
Original - Code Corrected	2.68
Code Corrected - Hand Corrected	5.64

So as seen from the results, overall Levenshtein scores decrease but also mean Levenshtein differences (between Original - Hand Corrected and Original - Code Corrected) decrease from 3.59 to 2.05.

Similarly the difference between code corrected and hand corrected also decreases from 7.98 to 5.64. Which shows that code corrected tweets are not diverging but getting closer to hand corrected tweets (considered as true value). Other results after these changes are:

R2 score: 0.2596 MSE: 24.95

GlobalMaksimum / sadedegel

Spelling Correction Test Results #283