Adding Data for LM Pre-training

It might be useful for us to try a language modeling based approach.

Since 18K was too small a sample for that, I am adding

Medium: 167K tweets with High Confidence in them being Hinglish
Large: 384K tweets with Low Confidence in them being Hinglish

Data Label Notes

Low confidence data implies that this can include Filipino, Indonesian Bahasa, Pakistani Urdu-English in addition to Hinglish. This is acceptable as the target test/train data also has such impurities.
The low confidence dataset is a superset of high confidence one => everything which is in high confidence is already included in low confidence one

Data Statistics

Large:

Unique Tokens: 731.9K

Total Tokens: 4.5M tokens

Most Common Tokens

('RT', 97411),
 ('hai', 70560),
 ('ki', 44641),
 ('ko', 36458),
 ('ke', 33234),
 ('bhi', 30813),
 ('to', 30716),
 ('ka', 30499),
 ('se', 29308),
 ('ho', 24791),
 ('hi', 21437),
 ('nahi', 21230),
 ('k', 20712),
 ('me', 20274),
 ('aur', 16961),
 ('na', 14382),
 ('kar', 13167),
 ('ye', 13032),
 ('kya', 12469),
 ('h', 11275),
 ('ne', 11058),
 ('koi', 10597),
 ('nhi', 10150),
 ('mein', 9801),
 ('hain', 9785),
 ('ya', 9759),
 ('toh', 9613),
 ('is', 9276),
 ('kuch', 8591),
 ('tu', 8414),
 ('ek', 8391),
 ('tha', 8173),
 ('jo', 8090),
 ('liye', 7933),
 ('or', 7926),
 ('Ye', 7647),
 ('ji', 7608),
 ('he', 7589),
 ('raha', 7504),
 ('hai.', 7404),
 ('main', 7381),
 ('bhai', 7331),
 ('baat', 7314),
 ('😂', 7137),
 ('ha', 7017),

Vocabulary Distribution

546889 words appear exactly once. That is ~75% of all unique tokens.
73555 words appear exactly twice. That is ~10% of all unique tokens.
Indicates an incredibly long tail, which appears longer because Twitter handles are unique. This proves the importance of removing mentions.

NirantK / Hinglish

Adding Data for LM Pre-training #11

Data Label Notes

Data Statistics

Large:

Unique Tokens: 731.9K

Total Tokens: 4.5M tokens

Most Common Tokens

Vocabulary Distribution