AI4Bharat / indicnlp_corpus

Description Describes the IndicNLP corpus and associated datasets
158 stars 24 forks source link

Is the Hindi corpus is suitable for training a Hindi-BERT model? #7

Closed skmalviya closed 4 years ago

skmalviya commented 4 years ago

I have gone through the Google-BERT model training from scratch. I found that BERT requires consecutively related sentences e.g

sentence 1
sentence 2

it uses next-sentence prediction as a core method of training!! But in AI4Bharat-IndicNLP Hindi corpus consecutive sentences are mostly unrelated.

If it is not for BERT training then what would be your suggestion to start with to train a good Hindi BERT-Model.

anoopkunchukuttan commented 4 years ago

we have recently released a BERT model for Indian languages - https://indicnlp.ai4bharat.org/