Ekphrasis is a text processing tool, geared towards text from social networks, such as Twitter or Facebook. Ekphrasis performs tokenization, word normalization, word segmentation (for splitting hashtags) and spell correction, using word statistics from 2 big corpora (english Wikipedia, twitter - 330mil english tweets).
Came to this project for spelling in twitter text, but it doesn't quite work most of the time.
spell correction seems to only work when annotate is set as in the example. Now
take the same example and set annotate={} and spell correction is gone:
i saw the new john doe movie and it suuuuucks ! ! ! waisted <money> . . . bad movies <annoyed>
if I restore annotate={"hashtag", "...}, then it corrects suuuuucks to sucks
I'm not sure what is the connection between annotations and spell correction.
spelling-correction doesn't work in general. Again, going back to your pipeline example, change the first input sentence to inject some spelling errors: CANT WAIT for the neww seaason of #TwinPeaks, run it, you get:
cant wait for the neww seaason of twin peaks - i.e. no spell correction.
The spell_correct_elong doesn't seem to make a difference.
Yet, if I run:
from ekphrasis.classes.spellcorrect import SpellCorrector
sp = SpellCorrector(corpus="english")
print([sp.correct(x) for x in "neww seaason".split()])
Came to this project for spelling in twitter text, but it doesn't quite work most of the time.
spell correction seems to only work when
annotate
is set as in the example. Now take the same example and setannotate={}
and spell correction is gone:if I restore
annotate={"hashtag", "...}
, then it correctssuuuuucks
tosucks
I'm not sure what is the connection between annotations and spell correction.spelling-correction doesn't work in general. Again, going back to your pipeline example, change the first input sentence to inject some spelling errors:
CANT WAIT for the neww seaason of #TwinPeaks
, run it, you get:cant wait for the neww seaason of twin peaks
- i.e. no spell correction. Thespell_correct_elong
doesn't seem to make a difference.Yet, if I run:
It corrects:
['new', 'season']