code-for-venezuela / c4v-py

3 stars 3 forks source link

Cleaning data before BPE #35

Open marianelamin opened 4 years ago

marianelamin commented 4 years ago
dieko95 commented 4 years ago

@marianelamin Probably we should consider how we are going to treat emojis. As for the byte pair encoding, I'm not sure if it makes sense to include emojis.

Also, I would remove URLs.

Finally, I'm not 100% sure if we should use stemming. CC @Edilmo

marianelamin commented 4 years ago

When removing punctuation, our current method only removes . - : , ?. Should we also remove ! and " or '? @Edilmo @dieko95