Ekphrasis is a text processing tool, geared towards text from social networks, such as Twitter or Facebook. Ekphrasis performs tokenization, word normalization, word segmentation (for splitting hashtags) and spell correction, using word statistics from 2 big corpora (english Wikipedia, twitter - 330mil english tweets).
I understand that following Peter Norvig's approach for spelling correction it could be relatively easy to have the corrector work for any given language if a big (and reliable) corpus is available.
I know that you can set the corpus to either "english" or "twitter" and the function ekphrasis.utils.read_stats() will load the corresponding corpus file, I just don't know where do I have to store a new corpus file for it to be used for spelling corrections and what to set as corrector argument when instantiating a TextPreProcessor, for instance.
I understand that following Peter Norvig's approach for spelling correction it could be relatively easy to have the corrector work for any given language if a big (and reliable) corpus is available.
I know that you can set the corpus to either "english" or "twitter" and the function
ekphrasis.utils.read_stats()
will load the corresponding corpus file, I just don't know where do I have to store a new corpus file for it to be used for spelling corrections and what to set ascorrector
argument when instantiating aTextPreProcessor
, for instance.