Open IreneSucameli opened 2 years ago
Hi @IreneSucameli ,
I just looked at the dataset sizes:
Wikipedia dump was 2.7GB, OPUS (13GB - 2.7) 10.3GB and OSCAR 68GB.
We did not perform any upsampling/downsampling strategy (as it is e.g. used in some GPT-2 based papers).
Hi @stefan-it ,
thank you for the information provided! For what concerns the vocabulary size, instead? Could you kindly tell me how many GB is the vocabulary? Thanks
The "normal" and XXL model use the same 31.102 wordpiece-based vocab.
The pre-training corpus for the "normal" model is used, that has a size of 13GB (OPUS + Wikipedia). Please note, that sentencepiece was used for training a SPM model, then we converted the SPM vocab into a wordpiece-based vocab. This was necessary, because in 2019 no library such as Hugging Face Tokenizers did exist. The SPM vocab size was 31.000. Then 100 "unused" tokens were added (as it was done in BERT vocab).
SPM vocab had three special symbols: ['<unk>', '<s>', '</s>']
, so effective vocab size would be 31.000 - 3 = 30.997. Adding 100 unused tokens = 30.997 + 100 = 31.097. Then we need to add the following special tokens for BERT: ['[PAD]', '[UNK]', '[CLS]', '[SEP]', '[MASK]'
. Then the total/final vocab size is 31.097 + 5 = 31.102.
Ok, I see. Thank you very much, you have been very helpful!
Hi, could you please specify in what percentage Wikipedia, the OPUS and the OSCAR corpus were used for training ita-bert-xxl? Thanks