cbaziotis / ekphrasis

Ekphrasis is a text processing tool, geared towards text from social networks, such as Twitter or Facebook. Ekphrasis performs tokenization, word normalization, word segmentation (for splitting hashtags) and spell correction, using word statistics from 2 big corpora (english Wikipedia, twitter - 330mil english tweets).
MIT License
660 stars 91 forks source link

Segmentation: Preserve case? #19

Open davidbernat opened 4 years ago

davidbernat commented 4 years ago

The Segmentation tool you provide is excellent. One feature request:

Unless I am mistaken, the tool always provided the split words in 1. lower case, and 2. does not provide information for where spaces were inserted. Instead, a preserve_case or capitalize parameter would be helpful (for 1). The following code capitalizes the split string according to the capitalization used in the hashtag.


from ekphrasis.classes.segmenter import Segmenter
segmenter = Segmenter(corpus="twitter")

def word_segmentation(text, fix_case=True):
    words_string = segmenter.segment(text)
    if not fix_case:
        return words_string

    fixed = ""

    n_add = 0
    for i in range(len(words_string)):
        if words_string[i] == " " and text[i+n_add] != " ":
            n_add += 1
            fixed += " "
            continue

        is_capital = text[i-n_add].isupper()
        if is_capital:
            fixed += words_string[i].upper()
        else:
            fixed += words_string[i]
    return fixed

Of course, if the user is using camelCase or PascalCase, the capitalization may not be meaningful, but in other cases, this can be. For instance:

I #eatsomuch food --> I eat so much food. I care so much. #IranProtests --> I care so much. Iran Protests

Arguably, the use of a stand-alone hashtag approximately refers to a proper noun, in which case the adopted capitalization is meaningful.