AI4Bharat / Indic-BERT-v1

Indic-BERT-v1: BERT-based Multilingual Model for 11 Indic Languages and Indian-English. For latest Indic-BERT v2, check: https://github.com/AI4Bharat/IndicBERT
https://indicnlp.ai4bharat.org
MIT License
276 stars 41 forks source link

What is special in IndicBERT compared to other models? #31

Closed chayan-dhaddha closed 3 years ago

gowtham1997 commented 3 years ago

IndicBERT is a multilingual ALBERT model pretrained exclusively on 12 major Indian languages - Assamese, Bengali, English, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, Telugu.

It is pre-trained on our novel monolingual corpus IndicCorp which contains around 9 billion tokens and subsequently evaluated on a set of diverse tasks. IndicBERT has much fewer parameters (refer to the attached picture) than other multilingual models (mBERT, XLM-R etc.) while it also achieves a performance on-par or better than these models.

image

You can get more details about the tasks and IndicBert model from our paper.

As this is not an GitHub issue wrt the code, I'm closing this.