Closed backyes closed 5 years ago
For all learners, the actual batch size, or number of training samples used for gradient computation and averaging, equals to FLAGS.batch_size
times the number of GPUs. If you find any inconsistent implementation, please tell us so that we can fix it.
When I tuning distributed training performance, try to go deep insight to stat performance, batch_size is critical variable.
Code shows, full_precision learner use batch_size from command line, means total batch size is nb_gpus * batch_size for synchronized sgd. However in other learner, batch_size is multipled by nb_gpus, and used to set some hyper parameter carefully.
To clarify these key point, can community give some more high level comments on 'batch_size',