Open marianelamin opened 4 years ago
@marianelamin Probably we should consider how we are going to treat emojis. As for the byte pair encoding, I'm not sure if it makes sense to include emojis.
Also, I would remove URLs.
Finally, I'm not 100% sure if we should use stemming. CC @Edilmo
When removing punctuation, our current method only removes .
-
:
,
?
. Should we also remove !
and "
or '
?
@Edilmo @dieko95
[x] Create a set of data cleaning methods
á é í ó ú
->aeiou
andñ
->gn
.
-
:
,
?
[x] Create the
Cleaning class
. The idea is that each method above belongs to the cleaning class. This can be part of thec4v
nlp cleaning library.