cbaziotis / ekphrasis

Ekphrasis is a text processing tool, geared towards text from social networks, such as Twitter or Facebook. Ekphrasis performs tokenization, word normalization, word segmentation (for splitting hashtags) and spell correction, using word statistics from 2 big corpora (english Wikipedia, twitter - 330mil english tweets).
MIT License
661 stars 90 forks source link

spelling correction mostly is not working #20

Open stas00 opened 4 years ago

stas00 commented 4 years ago

Came to this project for spelling in twitter text, but it doesn't quite work most of the time.

  1. spell correction seems to only work when annotate is set as in the example. Now take the same example and set annotate={} and spell correction is gone:

    i saw the new john doe movie and it suuuuucks ! ! ! waisted <money> . . . bad movies <annoyed>

    if I restore annotate={"hashtag", "...}, then it corrects suuuuucks to sucks I'm not sure what is the connection between annotations and spell correction.

  2. spelling-correction doesn't work in general. Again, going back to your pipeline example, change the first input sentence to inject some spelling errors: CANT WAIT for the neww seaason of #TwinPeaks, run it, you get: cant wait for the neww seaason of twin peaks - i.e. no spell correction. The spell_correct_elong doesn't seem to make a difference.

Yet, if I run:

from ekphrasis.classes.spellcorrect import SpellCorrector
sp = SpellCorrector(corpus="english") 
print([sp.correct(x) for x in "neww seaason".split()])

It corrects: ['new', 'season']