It might be useful for us to try a language modeling based approach.
Since 18K was too small a sample for that, I am adding
Medium: 167K tweets with High Confidence in them being Hinglish
Large: 384K tweets with Low Confidence in them being Hinglish
Data Label Notes
Low confidence data implies that this can include Filipino, Indonesian Bahasa, Pakistani Urdu-English in addition to Hinglish. This is acceptable as the target test/train data also has such impurities.
The low confidence dataset is a superset of high confidence one => everything which is in high confidence is already included in low confidence one
It might be useful for us to try a language modeling based approach.
Since 18K was too small a sample for that, I am adding
Data Label Notes
Low confidence data implies that this can include Filipino, Indonesian Bahasa, Pakistani Urdu-English in addition to Hinglish. This is acceptable as the target test/train data also has such impurities.
The low confidence dataset is a superset of high confidence one => everything which is in high confidence is already included in low confidence one
Data Statistics
Large:
Unique Tokens: 731.9K
Total Tokens: 4.5M tokens
Most Common Tokens
Vocabulary Distribution