How to train bert-base-italian-* models?

nikhilno1 commented 4 years ago

Thanks for sharing. I want to train a different language model (Hindi). How did you train your bert-base-italian-* models? Are those steps covered anywhere?

stefan-it commented 4 years ago

Hi @nikhilno1 ,

for training the Italian models we did the following steps:

Collect corpora (we mainly used OPUS and OSCAR)
Generate vocabulary (using SentencePiece, here's an example command in sciBERT repo and convert it to a BERT-compatible one
Sentence splitting (we use NLTK, because it is much faster than spacy)
Sharding (1GB text per shard)
TFRecord generation for a sequence length of 512. More information can be found in the official BERT repo on that topic
Train BERT model on TPU v3-8

I do plan to write a cheatsheet for an upcoming BERT model, where I use the awesome new Hugging Face tokenizers library for creating the BERT vocab!

stefan-it commented 4 years ago

Hi @nikhilno1,

for the Turkish BERT model I created a cheatsheet for the training process:

https://github.com/stefan-it/turkish-bert/blob/master/CHEATSHEET.md

It also shows how to generate a BERT-compatible vocab.

I hope this helps + good luck with the Hindi model!

nikhilno1 commented 4 years ago

Thanks for sharing. Will go through it.

dbmdz / berts

How to train bert-base-italian-* models? #4