BERT_BASE and BERT_LARGE contains 110M and 340M parameters respectively. Currently multi-GPU scaling is poor for this model and the result shows large overhead for cross-GPU ndarray copies.
The default kvstore push/pull do not leverage the communication pattern on the machine (e.g. AWS p3 instance). It would be great to use the experimental tree reduce push/pull introduced by @ctcyang.
However, the following error occurs for MXNET_KVSTORE_LOGTREE=1 MXNET_KVSTORE_USETREE=1
[07:02:29] src/kvstore/./././gpu_topology.h:60: Weight:
[07:02:29] src/kvstore/./././gpu_topology.h:67: 0 2 2 0 0 0 0 0
[07:02:29] src/kvstore/./././gpu_topology.h:67: 2 0 0 2 0 0 0 0
[07:02:29] src/kvstore/./././gpu_topology.h:67: 2 0 0 0 0 0 2 0
[07:02:29] src/kvstore/./././gpu_topology.h:67: 0 2 0 0 0 0 0 2
[07:02:29] src/kvstore/./././gpu_topology.h:67: 0 0 0 0 0 2 2 0
[07:02:29] src/kvstore/./././gpu_topology.h:67: 0 0 0 0 2 0 0 2
[07:02:29] src/kvstore/./././gpu_topology.h:67: 0 0 2 0 2 0 0 0
[07:02:29] src/kvstore/./././gpu_topology.h:67: 0 0 0 2 0 2 0 0
[06:50:46] src/kvstore/././comm_tree.h:381: Using Kernighan-Lin to generate trees
[06:50:46] src/kvstore/./././gpu_topology.h:1030: No valid binary tree found from root 0, try backtracking
Traceback (most recent call last):
File "run_pretraining.py", line 257, in <module>
train()
File "run_pretraining.py", line 228, in train
trainer.step(1)
File "/home/ubuntu/mxnet/python/mxnet/gluon/trainer.py", line 290, in step
self._allreduce_grads()
File "/home/ubuntu/mxnet/python/mxnet/gluon/trainer.py", line 320, in _allreduce_grads
self._kvstore.push(i, param.list_grad(), priority=-i)
File "/home/ubuntu/mxnet/python/mxnet/kvstore.py", line 237, in push
self.handle, mx_uint(len(ckeys)), ckeys, cvals, ctypes.c_int(priority)))
File "/home/ubuntu/mxnet/python/mxnet/base.py", line 252, in check_call
raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [06:50:46] src/kvstore/./././gpu_topology.h:1040: No valid binary tree found from root 0 using backtracking
BERT_BASE and BERT_LARGE contains 110M and 340M parameters respectively. Currently multi-GPU scaling is poor for this model and the result shows large overhead for cross-GPU ndarray copies.
The default kvstore push/pull do not leverage the communication pattern on the machine (e.g. AWS p3 instance). It would be great to use the experimental tree reduce push/pull introduced by @ctcyang.
However, the following error occurs for
MXNET_KVSTORE_LOGTREE=1 MXNET_KVSTORE_USETREE=1