NirantK / Hinglish

Hinglish Text Classification
MIT License
30 stars 10 forks source link

Adding Data for LM Pre-training #11

Closed NirantK closed 4 years ago

NirantK commented 4 years ago

It might be useful for us to try a language modeling based approach.

Since 18K was too small a sample for that, I am adding

Data Label Notes

Data Statistics

Large:

Unique Tokens: 731.9K

Total Tokens: 4.5M tokens

Most Common Tokens

('RT', 97411),
 ('hai', 70560),
 ('ki', 44641),
 ('ko', 36458),
 ('ke', 33234),
 ('bhi', 30813),
 ('to', 30716),
 ('ka', 30499),
 ('se', 29308),
 ('ho', 24791),
 ('hi', 21437),
 ('nahi', 21230),
 ('k', 20712),
 ('me', 20274),
 ('aur', 16961),
 ('na', 14382),
 ('kar', 13167),
 ('ye', 13032),
 ('kya', 12469),
 ('h', 11275),
 ('ne', 11058),
 ('koi', 10597),
 ('nhi', 10150),
 ('mein', 9801),
 ('hain', 9785),
 ('ya', 9759),
 ('toh', 9613),
 ('is', 9276),
 ('kuch', 8591),
 ('tu', 8414),
 ('ek', 8391),
 ('tha', 8173),
 ('jo', 8090),
 ('liye', 7933),
 ('or', 7926),
 ('Ye', 7647),
 ('ji', 7608),
 ('he', 7589),
 ('raha', 7504),
 ('hai.', 7404),
 ('main', 7381),
 ('bhai', 7331),
 ('baat', 7314),
 ('😂', 7137),
 ('ha', 7017),

Vocabulary Distribution

review-notebook-app[bot] commented 4 years ago

Check out this pull request on  ReviewNB

You'll be able to see Jupyter notebook diff and discuss changes. Powered by ReviewNB.