Closed achen46 closed 2 years ago
Hi and thank you for your interest. We tried both single and multi node settings, but did not notice any significant difference between the two. If you want to train on multiple nodes, you'd have to divide the batch size to keep it consistent. All of the models we've released were trained with a batch size of 1024, which is 128 samples per GPU, hence the 128 in the config files. If you increase the number of GPUs, you'd have to decrease batch size (i.e. 2 nodes with 16 GPUs -> batch size 64).
I hope this clarifies things.
Thanks for clarification. I will try to reproduce your numbers. Great work !
Hi @alihassanijr , thanks for the great repository. For reproducing your results, how many nodes were used to train these models ? I see that config files are provided for each model, but wonder if any changes are needed if trained on multi-node.