Ekphrasis is a text processing tool, geared towards text from social networks, such as Twitter or Facebook. Ekphrasis performs tokenization, word normalization, word segmentation (for splitting hashtags) and spell correction, using word statistics from 2 big corpora (english Wikipedia, twitter - 330mil english tweets).
The Segmentation tool you provide is excellent. One feature request:
Unless I am mistaken, the tool always provided the split words in 1. lower case, and 2. does not provide information for where spaces were inserted. Instead, a preserve_case or capitalize parameter would be helpful (for 1). The following code capitalizes the split string according to the capitalization used in the hashtag.
from ekphrasis.classes.segmenter import Segmenter
segmenter = Segmenter(corpus="twitter")
def word_segmentation(text, fix_case=True):
words_string = segmenter.segment(text)
if not fix_case:
return words_string
fixed = ""
n_add = 0
for i in range(len(words_string)):
if words_string[i] == " " and text[i+n_add] != " ":
n_add += 1
fixed += " "
continue
is_capital = text[i-n_add].isupper()
if is_capital:
fixed += words_string[i].upper()
else:
fixed += words_string[i]
return fixed
Of course, if the user is using camelCase or PascalCase, the capitalization may not be meaningful, but in other cases, this can be. For instance:
I #eatsomuch food --> I eat so much food.I care so much. #IranProtests --> I care so much. Iran Protests
Arguably, the use of a stand-alone hashtag approximately refers to a proper noun, in which case the adopted capitalization is meaningful.
The Segmentation tool you provide is excellent. One feature request:
Unless I am mistaken, the tool always provided the split words in 1. lower case, and 2. does not provide information for where spaces were inserted. Instead, a
preserve_case
orcapitalize
parameter would be helpful (for 1). The following code capitalizes the split string according to the capitalization used in the hashtag.Of course, if the user is using camelCase or PascalCase, the capitalization may not be meaningful, but in other cases, this can be. For instance:
I #eatsomuch food
-->I eat so much food.
I care so much. #IranProtests
-->I care so much. Iran Protests
Arguably, the use of a stand-alone hashtag approximately refers to a proper noun, in which case the adopted capitalization is meaningful.