davidmogar / cucco

Text normalization library for Python
MIT License
203 stars 27 forks source link

Emojis are not removed #47

Open WoaDmulL opened 6 years ago

WoaDmulL commented 6 years ago

The new emojis like 🤗, 🥂, 🤔, 🤘fail to be removed.

Check this gist https://gist.github.com/octohedron/3823d081eb1b92abe93b570875ec77f4

davidmogar commented 6 years ago

Thank you for the report. I'll try to fix this as soon as possible.

davidmogar commented 6 years ago

I'm currently considering removing the current regex and use instead this library as it would allow to replace the emojis with text. That was the initial goal. What do you think?

WoaDmulL commented 6 years ago

I think it would be better to remove them by default; and replace them with some custom setting or optional argument passed to the function when needed. Or the opposite way.

davidmogar commented 6 years ago

Yeah, that is the current behavior. The problem with it is that you lose information in the process. Contemplate the next scenario:

In both cases, applying the current code you'd get Today I'm what removes completely the useful information.

Just something to think about...

WoaDmulL commented 6 years ago

We could cover all cases:

normalize("Today I'm 😄") => Today I'm 😄
strip_emoji("Today I'm 😄") => Today I'm
describe_emoji("Today I'm 😄") => Today I'm happy