NirantK / Hinglish

Hinglish Text Classification
MIT License
30 stars 10 forks source link

Remove Filipino and non-Hinglish data #8

Closed NirantK closed 4 years ago

NirantK commented 4 years ago

They have included Filipino tweets in a dataset of Hinglish tweets 🤦‍♂

I wonder how did they assign it a sentiment - do they speak Filipino too?

The truncated tweets also indicate that this was longer 280 char tweet, which they truncated to 140 characters.

Example:

happy birthday seatmate thank you kasi masginanahan ako magaral kasi katabi ko kayo ni dom hahahhah thank you are https t co jyeeskiyf

which translates to:

happy birthday seatmate thank you for helping me study because i was with you hahahhah thank you are https t co jyeeskiyf
NirantK commented 4 years ago

Notebook shows how 532 tweets add vocabulary from 27 additional languages which we do not want.

Since these are less than 3% of our tagged set, it might be a good idea to drop these 532 tweets to reduce the variance in character set?

meghanabhange commented 4 years ago

Alright, I'll check this out. I think we should be able to drop it without much of an impact

meghanabhange commented 4 years ago

Resolved by #19