dmlc / gluon-nlp

NLP made easy
https://nlp.gluon.ai/
Apache License 2.0
2.56k stars 538 forks source link

multilingual bert requested #1039

Open timespaceuniverse opened 4 years ago

timespaceuniverse commented 4 years ago

Description

bert model trained with multilingual corpus was released by google. Basically corpus with different-top used languages are prepared and multilingual word-tokens are precessed with BPE(Byte pair encoding) . Below features are requested. 1.multilingual corpus preparing code. 2.multilingual word-tokens precessed of BPE with code. 3.pre-training code of bert with multilingual corpus 4.fine-tuning with multilingual corpus

References

google-research use 102 languages to train bert.

''' Data Source and Sampling The languages chosen were the top 100 languages with the largest Wikipedias. The entire Wikipedia dump for each language (excluding user and talk pages) was taken as the training data for each language

However, the size of the Wikipedia for a given language varies greatly, and therefore low-resource languages may be "under-represented" in terms of the neural network model (under the assumption that languages are "competing" for limited model capacity to some extent). At the same time, we also don't want to overfit the model by performing thousands of epochs over a tiny Wikipedia for a particular language.

To balance these two factors, we performed exponentially smoothed weighting of the data during pre-training data creation (and WordPiece vocab creation). In other words, let's say that the probability of a language is P(L), e.g., P(English) = 0.21 means that after concatenating all of the Wikipedias together, 21% of our data is English. We exponentiate each probability by some factor S and then re-normalize, and sample from that distribution. In our case we use S=0.7. So, high-resource languages like English will be under-sampled, and low-resource languages like Icelandic will be over-sampled. E.g., in the original distribution English would be sampled 1000x more than Icelandic, but after smoothing it's only sampled 100x more. '''

https://github.com/google-research/bert/blob/master/multilingual.md https://github.com/google-research/bert

eric-haibin-lin commented 4 years ago

@kaonashi-tyc anything you can contribute back with your bert multilingual work on gluonnlp?

kaonashi-tyc commented 4 years ago

@eric-haibin-lin would like to discuss the opportunity. Currently we are focusing on our effort of XLM-R style multilingual model pretraining.

Vedant-06 commented 4 years ago

@kaonashi-tyc I would like to contribute BERT Multilingual, in indic languages