[chainermn] - incorrect epoch count on large batchsizes or large number of GPUs

MannyKayy commented 5 years ago

chainermn runs/returns/prints an incorrect number of epochs when dealing with large batch sizes or a large number of GPUs. I noticed this issue with the chainermn mnist example code. The larger the batchsize and the more GPUs used, the lower the number of completed epochs.

Batch size: 2048 GPU Count: 16 mpiexec -x LD_LIBRARY_PATH -x LD_RUN_PATH -x PATH --hostfile ~/hostfile_b -np 16 python ~/chainermn/examples/mnist/train_mnist.py --gpu -b 2048 -e 20 --communicator pure_nccl

Batch size: 2048 GPU Count: 32 mpiexec -x LD_LIBRARY_PATH -x LD_RUN_PATH -x PATH --hostfile ~/hostfile_b -np 32 python ~/chainermn/examples/mnist/train_mnist.py --gpu -b 2048 -e 20 --communicator pure_nccl

Batch size: 4096 GPU Count: 16 mpiexec -x LD_LIBRARY_PATH -x LD_RUN_PATH -x PATH --hostfile ~/hostfile_b -np 16 python ~/chainermn/examples/mnist/train_mnist.py --gpu -b 4096 -e 20 --communicator pure_nccl

Batch size: 8192 GPU Count: 16 mpiexec -x LD_LIBRARY_PATH -x LD_RUN_PATH -x PATH --hostfile ~/hostfile_b -np 16 python ~/chainermn/examples/mnist/train_mnist.py --gpu -b 8192 -e 20 --communicator pure_nccl

Batch size: 16384 GPU Count: 16 mpiexec -x LD_LIBRARY_PATH -x LD_RUN_PATH -x PATH --hostfile ~/hostfile_b -np 16 python ~/chainermn/examples/mnist/train_mnist.py --gpu -b 16384 -e 20 --communicator pure_nccl

Increasing the number of GPUs make the problem worse: Batch size: 16384 GPU Count: 28 mpiexec -x LD_LIBRARY_PATH -x LD_RUN_PATH -x PATH --hostfile ~/hostfile_b -np 28 python ~/chainermn/examples/mnist/train_mnist.py --gpu -b 16384 -e 20 --communicator pure_nccl

Decreasing the GPU count helps with this problem.. Batch size: 16384 GPU Count: 4 mpiexec -x LD_LIBRARY_PATH -x LD_RUN_PATH -x PATH --hostfile ~/hostfile_b -np 4 python ~/chainermn/examples/mnist/train_mnist.py --gpu -b 16384 -e 20 --communicator pure_nccl

System Info: * `python -c 'import chainer; chainer.print_runtime_info()'

kuenishi commented 5 years ago

Please note that actual batch size in data parallelism is (number of nodes) * (batch size in each node), which is the case the MNIST example, --batchsize is actually the batch size in each node. Also, the training data size of MNIST dataset is 60000. For example, -b 16384 and -np 18 means actually its batch size is 458752. That is more than 7 times of MNIST dataset and a single iteration includes 3 epochs. For the very now Chainer assumes batchsize is much smaller than the size of dataset, and I believe that is the most case for SGD training, which can be improved in future.

MannyKayy commented 5 years ago

I see, alright. Thanks for clearing that up.

chainer / chainer

[chainermn] - incorrect epoch count on large batchsizes or large number of GPUs #5622