GlobalMaksimum / sadedegel

A General Purpose NLP library for Turkish
http://sadedegel.ai
MIT License
93 stars 13 forks source link

Spelling Correction Test Results #283

Open irmakyucel opened 3 years ago

irmakyucel commented 3 years ago

I continued testing and analyzing Spelling Correction module as explained in pull request #213. There was a lack of dataset to test this module's performance and in order to overcome this issue 100 tweets with spelling mistakes were collected and hand corrected. Then in order to measure similarity of texts a Levenshtein score function is implemented. Right now the Spelling Correction module works as follows:

d = Doc(str)
correct_d = corrector.correct_doc(d) 
levenshtein_dist(d,correct_d)

An example of this is shown below:

"kilgi verir misin" --> "bilgi verir misin" with Levenshtein score = 1

Initial error analysis is done using the mean of this metric (Mean Levenshtein Score). The results of the initial analysis are:

Tweet Type Levenshtein Score
Original - Hand Corrected 6.82
Original - Code Corrected 3.23

This shows that the code corrected tweets are not as different from the original as the hand corrected tweets. Then another comparison is done for showing this from a different point of view. This showed that the code corrected ones are not getting closer to hand corrected ones but rather diverging.

Tweet Type Levenshtein Score
Original - Hand Corrected 6.82
Code Corrected - Hand Corrected 7.98

Overall other results of this initial analysis are:

best_max_edit_distance = 0 worst_max_edit_distance = 24

MSE: 42.23 R2 score: 0.0093

As there were both perfectly corrected and not very well corrected texts, I decided to further analyze this issue by checking each data point and detecting their issues for not performing well. This resulted in the following issues with the Spelling Correction Module. The issues below are listed from most common to least common:

Some of these errors could not be fixed easily but in order to fix some of it I lowercased all words, cleared punctuation, hashtags, links, repetitions (using pull request #277) and emojis. The results after these changes are shown below:

Tweet Type Levenshtein Score
Original - Hand Corrected 4.73
Original - Code Corrected 2.68
Code Corrected - Hand Corrected 5.64

So as seen from the results, overall Levenshtein scores decrease but also mean Levenshtein differences (between Original - Hand Corrected and Original - Code Corrected) decrease from 3.59 to 2.05.

Similarly the difference between code corrected and hand corrected also decreases from 7.98 to 5.64. Which shows that code corrected tweets are not diverging but getting closer to hand corrected tweets (considered as true value). Other results after these changes are:

R2 score: 0.2596 MSE: 24.95

askarbozcan commented 2 years ago

Note to self: Either remove spelling correction entirely or revamp it completely with techniques that work better.