dbmdz / berts

DBMDZ BERT, DistilBERT, ELECTRA, GPT-2 and ConvBERT models
MIT License
155 stars 12 forks source link

Bert-ita-xxl #42

Open IreneSucameli opened 2 years ago

IreneSucameli commented 2 years ago

Hi, could you please specify in what percentage Wikipedia, the OPUS and the OSCAR corpus were used for training ita-bert-xxl? Thanks

stefan-it commented 2 years ago

Hi @IreneSucameli ,

I just looked at the dataset sizes:

Wikipedia dump was 2.7GB, OPUS (13GB - 2.7) 10.3GB and OSCAR 68GB.

We did not perform any upsampling/downsampling strategy (as it is e.g. used in some GPT-2 based papers).

IreneSucameli commented 2 years ago

Hi @stefan-it ,

thank you for the information provided! For what concerns the vocabulary size, instead? Could you kindly tell me how many GB is the vocabulary? Thanks

stefan-it commented 2 years ago

The "normal" and XXL model use the same 31.102 wordpiece-based vocab.

The pre-training corpus for the "normal" model is used, that has a size of 13GB (OPUS + Wikipedia). Please note, that sentencepiece was used for training a SPM model, then we converted the SPM vocab into a wordpiece-based vocab. This was necessary, because in 2019 no library such as Hugging Face Tokenizers did exist. The SPM vocab size was 31.000. Then 100 "unused" tokens were added (as it was done in BERT vocab).

SPM vocab had three special symbols: ['<unk>', '<s>', '</s>'], so effective vocab size would be 31.000 - 3 = 30.997. Adding 100 unused tokens = 30.997 + 100 = 31.097. Then we need to add the following special tokens for BERT: ['[PAD]', '[UNK]', '[CLS]', '[SEP]', '[MASK]'. Then the total/final vocab size is 31.097 + 5 = 31.102.

IreneSucameli commented 2 years ago

Ok, I see. Thank you very much, you have been very helpful!