JonasGeiping / cramming

Cramming the training of a (BERT-type) language model into limited compute.
MIT License
1.29k stars 100 forks source link

Training Step Count #11

Closed ekurtulus closed 1 year ago

ekurtulus commented 1 year ago

I am asking this for benchmarking purposes. In the config files, it is stated that training lasts 600_000 micro-batch steps and is terminated in 1 day if it does not reach it. How many training steps are actually taken using an RTX-A4000 in a day ?

JonasGeiping commented 1 year ago

This depends, of course, on the variant that is running, for c5-o3, a run on our setup with the A4000 takes on average ~245000 (micro-batch) steps.

Because all systems are built a bit different, I don't think this is such a meaningful number by itself. For benchmarking, my suggestion would always be to rerun the baselines on your system and target budget, and to compare any changes you make to that baseline running on the same system.

JonasGeiping commented 1 year ago

Let me know if other questions come up!

ekurtulus commented 1 year ago

Thank you very much for your answer ! I have another question. The tokenizer used for the trained BERT models are bert-x-cased where x is either base or large, right ?

JonasGeiping commented 1 year ago

Hi, do you mean the baseline, pretrained BERT models?

The baseline comparison (e.g. in Table.3, row 1) is to bert-base-uncased. If I remember correctly, this was a tiny bit better than bert-base-cased in this evaluation.

JonasGeiping commented 1 year ago

Closing this for now.