Closed lord-alfred closed 5 years ago
Sometimes in texts contains emojis, I think for many peoples who use textacy - they not need in results of preprocessing.
preprocess_text method needs a new argument like no_emoji.
preprocess_text
no_emoji
Some useful links: https://www.regextester.com/106421 (match some special symbols like (c) / (r) and etc) https://unicode.org/Public/emoji/12.0/emoji-test.txt https://unicode.org/Public/emoji/12.0/ http://www.unicode.org/emoji/charts/full-emoji-list.html
Hi @lord-alfred , related functionality is already implemented in spacymoji. You could filter out emoji tokens post-processing like [tok for tok in doc if not tok._.is_emoji]. If this doesn't meet your use case, please let me know.
spacymoji
[tok for tok in doc if not tok._.is_emoji]
context
Sometimes in texts contains emojis, I think for many peoples who use textacy - they not need in results of preprocessing.
proposed solution
preprocess_text
method needs a new argument likeno_emoji
.Some useful links: https://www.regextester.com/106421 (match some special symbols like (c) / (r) and etc) https://unicode.org/Public/emoji/12.0/emoji-test.txt https://unicode.org/Public/emoji/12.0/ http://www.unicode.org/emoji/charts/full-emoji-list.html