Annotation process on Hinglish dataset

NirantK / Hinglish

Hinglish Text Classification

MIT License

30 stars 10 forks source link

Annotation process on Hinglish dataset #53

Closed sharmila-polamuri closed 3 years ago

sharmila-polamuri commented 3 years ago

Hello guys, I am trying to know how to annotate Hinglish data that should be useful for finetune language models or building a language model. I saw your train.json file on that text is tweet, and clean_text is preprocessed text. And what about Hindi words in the tweet? Are Hindi words tagged? I mean, I read in an article they were annotating Hinglish tweets with corresponding English version sentences. In another article, they were mentioned like each word in Hinglish sentence tagged with corresponding language and whole sentence tagged with either sentiment classification labels or any other classification problem labels.

I just wanted to know what process we were using for training data annotation purposes. Can you please tell me?

NirantK commented 3 years ago

Hello @sharmila-polamuri - thanks for your interest in our work!

Are Hindi words tagged? No, they are not.

We've two kinds of datasets:

For making the language model itself: This is just plain Hinglish text and doesn't need any supervised labels. Since the task used is typically next word prediction, this is a self-supervised learning task.
For finetuning the LM for a specific task e.g. text classification: This is where we have labels e.g. positive, negative or neutral in case of sentiment analysis. The whole sentence is tagged with the classification label.

Does this answer your question? If yes, please close the issue :)

sharmila-polamuri commented 3 years ago

Thank you for your response @NirantK

But I have a doubt without labeling the Hinglish sentences, how can the language model know or differentiate words between English and Hindi (2 languages (any Local language + English language)) if sentence like I am traveling in gadi? (here gadi is an hindi word in the whole english sentence). Please Can you clarify this one?

NirantK commented 3 years ago

The advantage of LMs trained with sub-word methods like Byte Pair Encoding is that the word "gadi" does not have to be added explicitly.

So we don't really need to recognize that the word is in Hindi. Or differentiate on the basis of that. The language identification task is something that doesn't need to be solved.

NirantK commented 3 years ago

Closing this issue for now. Please raise a new issue if you're further questions @sharmila-polamuri !

Thanks for your interest in our work! Happy to answer and assist in anyway we can :)