Closed MannyKayy closed 5 years ago
Please note that actual batch size in data parallelism is (number of nodes) * (batch size in each node), which is the case the MNIST example, --batchsize
is actually the batch size in each node. Also, the training data size of MNIST dataset is 60000. For example, -b 16384
and -np 18
means actually its batch size is 458752. That is more than 7 times of MNIST dataset and a single iteration includes 3 epochs. For the very now Chainer assumes batchsize is much smaller than the size of dataset, and I believe that is the most case for SGD training, which can be improved in future.
I see, alright. Thanks for clearing that up.
chainermn runs/returns/prints an incorrect number of epochs when dealing with large batch sizes or a large number of GPUs. I noticed this issue with the chainermn mnist example code. The larger the batchsize and the more GPUs used, the lower the number of completed epochs.
Batch size: 2048 GPU Count: 16
mpiexec -x LD_LIBRARY_PATH -x LD_RUN_PATH -x PATH --hostfile ~/hostfile_b -np 16 python ~/chainermn/examples/mnist/train_mnist.py --gpu -b 2048 -e 20 --communicator pure_nccl
Batch size: 2048 GPU Count: 32
mpiexec -x LD_LIBRARY_PATH -x LD_RUN_PATH -x PATH --hostfile ~/hostfile_b -np 32 python ~/chainermn/examples/mnist/train_mnist.py --gpu -b 2048 -e 20 --communicator pure_nccl
Batch size: 4096 GPU Count: 16
mpiexec -x LD_LIBRARY_PATH -x LD_RUN_PATH -x PATH --hostfile ~/hostfile_b -np 16 python ~/chainermn/examples/mnist/train_mnist.py --gpu -b 4096 -e 20 --communicator pure_nccl
Batch size: 8192 GPU Count: 16
mpiexec -x LD_LIBRARY_PATH -x LD_RUN_PATH -x PATH --hostfile ~/hostfile_b -np 16 python ~/chainermn/examples/mnist/train_mnist.py --gpu -b 8192 -e 20 --communicator pure_nccl
Batch size: 16384 GPU Count: 16
mpiexec -x LD_LIBRARY_PATH -x LD_RUN_PATH -x PATH --hostfile ~/hostfile_b -np 16 python ~/chainermn/examples/mnist/train_mnist.py --gpu -b 16384 -e 20 --communicator pure_nccl
Increasing the number of GPUs make the problem worse: Batch size: 16384 GPU Count: 28
mpiexec -x LD_LIBRARY_PATH -x LD_RUN_PATH -x PATH --hostfile ~/hostfile_b -np 28 python ~/chainermn/examples/mnist/train_mnist.py --gpu -b 16384 -e 20 --communicator pure_nccl
Decreasing the GPU count helps with this problem.. Batch size: 16384 GPU Count: 4
mpiexec -x LD_LIBRARY_PATH -x LD_RUN_PATH -x PATH --hostfile ~/hostfile_b -np 4 python ~/chainermn/examples/mnist/train_mnist.py --gpu -b 16384 -e 20 --communicator pure_nccl