Closed NirantK closed 4 years ago
Notebook shows how 532 tweets add vocabulary from 27 additional languages which we do not want.
Since these are less than 3% of our tagged set, it might be a good idea to drop these 532 tweets to reduce the variance in character set?
Alright, I'll check this out. I think we should be able to drop it without much of an impact
Resolved by #19
They have included Filipino tweets in a dataset of Hinglish tweets 🤦♂
I wonder how did they assign it a sentiment - do they speak Filipino too?
The truncated tweets also indicate that this was longer 280 char tweet, which they truncated to 140 characters.
Example:
which translates to: