cbaziotis / ekphrasis

Ekphrasis is a text processing tool, geared towards text from social networks, such as Twitter or Facebook. Ekphrasis performs tokenization, word normalization, word segmentation (for splitting hashtags) and spell correction, using word statistics from 2 big corpora (english Wikipedia, twitter - 330mil english tweets).
MIT License
661 stars 90 forks source link

The TextPreProcessor class only supports segmenting text with hastags. Required support for normal text segmenter. #15

Closed aman5319 closed 5 years ago

aman5319 commented 5 years ago

The TextPreProcessor class only supports word segmenting if hashtag symbol is there otherwise it fails.

Example:-

# With hashtag it works
s = " question kind infidelity passed sweety not feel sweet #savingyourmarriagebeforeitstarts"
print(" ".join(text_processor.pre_process_doc(s)))
'question kind infidelity passed sweety not feel sweet <hashtag> saving your marriage before it starts </hashtag>'

#without hashtag it fails
s = " question kind infidelity passed sweety not feel sweet savingyourmarriagebeforeitstarts"
print(" ".join(text_processor.pre_process_doc(s)))
" question kind infidelity passed sweety not feel sweet savingyourmarriagebeforeitstarts"

The TextPreProcessor class configuration is similar to what is defined in README.md file.

Kindly review it and if you find that correct, I can send a pull request.