chartbeat-labs / textacy

NLP, before and after spaCy
https://textacy.readthedocs.io
Other
2.21k stars 249 forks source link

Clean emoji in preprocess_text method #238

Closed lord-alfred closed 5 years ago

lord-alfred commented 5 years ago

context

Sometimes in texts contains emojis, I think for many peoples who use textacy - they not need in results of preprocessing.

proposed solution

preprocess_text method needs a new argument like no_emoji.

Some useful links: https://www.regextester.com/106421 (match some special symbols like (c) / (r) and etc) https://unicode.org/Public/emoji/12.0/emoji-test.txt https://unicode.org/Public/emoji/12.0/ http://www.unicode.org/emoji/charts/full-emoji-list.html

bdewilde commented 5 years ago

Hi @lord-alfred , related functionality is already implemented in spacymoji. You could filter out emoji tokens post-processing like [tok for tok in doc if not tok._.is_emoji]. If this doesn't meet your use case, please let me know.