Closed fqez closed 3 years ago
Hello @fqez,
Regarding the hardware, we used just 1 (one) single TPU v3-8 (not a pod). Is hard to say how much training time since the hardware is preemtible, but as stated in the paper we trained for 1M steps and later another 1M steps (thanks to TFRC support), so 2M steps.
About the second question, I don't have experience on that matter but I'm pretty sure you can find something on https://github.com/huggingface/transformers/issues.
Thanks for your interest and sorry for the late response. Regards!
Hello!
I have read the paper and I would like to replicate the training of this model from scratch. The thing is that you specify the configuration of the pre-training (hyper-parameters), but you don't specify the cost of hardware and time to train the models you have generated. Can you provide that information? I'm interested in knowing the number of TPU v3-8 pods (the interruptible ones, as mentioned in the article) used for training and the training time (hours, days, weeks) if possible. I would like to estimate how much it would cost to train a model like yours. Also, to have such a model in a production environment, which minimum hardware would it require to work?
Thank you in advance! Regards :)