Pretraining hyperparameters

stefan-it commented 1 year ago

Hi @ficstamas ,

many thanks for open sourcing this very interesting implementation!

I would like to train own models with this implementation (as additional models to my ByT5 project on historic texts), so I wanted to know if you could give feedback about the hyperparameters that were used for pretraining this Hungarian model :thinking:

I would also be interested in number of GPUs used for pretraining and pretraining time for this model.

Many thanks in advance!

ficstamas commented 1 year ago

Hey,

Here's a rough list, let's hope I don't forget any important point:

We defined most of the architecture by google/electra-base-generator and google/electra-base-discriminator
We pretrained the model with sequence length of 1024 for 500K steps
We use Dynamic MLM (RoBERTa like) objective for the generator, and Sentence Order Prediction (SOP; introduced by AlBERT) and Discriminator objective (DO) for the discriminator
We used an SOP loss multiplier of 50 and DO loss multiplier of 1
We have 2 model specific parameters Block Size (BS; we recommend 4) and a Down-sampling Factor (DF; we recommend 4)
Training batch size was 16
We used a warmup of 15K, and linear scheduling afterwards (lr=8*10^-5)
Everything else was set default by Huggingface Trainer API (including AdamW parameters)
Probably we trained them on 2 x NVIDIA RTX 2080 Ti, I'm not entirely sure but I'm going to check it

We have a publication about it but sadly it is in hungarian. If you need to know anything else feel free to ask.

ficstamas commented 1 year ago

Probably we trained them on 2 x NVIDIA RTX 2080 Ti, I'm not entirely sure but I'm going to check it

Yep, we used 2 x NVIDIA RTX 2080 Ti at that time

ficstamas / charmen-electra

Pretraining hyperparameters #1