[BERT] Multi-GPU Training with Tree reduce

eric-haibin-lin commented 5 years ago

BERT_BASE and BERT_LARGE contains 110M and 340M parameters respectively. Currently multi-GPU scaling is poor for this model and the result shows large overhead for cross-GPU ndarray copies.

The default kvstore push/pull do not leverage the communication pattern on the machine (e.g. AWS p3 instance). It would be great to use the experimental tree reduce push/pull introduced by @ctcyang.

However, the following error occurs for MXNET_KVSTORE_LOGTREE=1 MXNET_KVSTORE_USETREE=1

[07:02:29] src/kvstore/./././gpu_topology.h:60: Weight:
[07:02:29] src/kvstore/./././gpu_topology.h:67: 0 2 2 0 0 0 0 0
[07:02:29] src/kvstore/./././gpu_topology.h:67: 2 0 0 2 0 0 0 0
[07:02:29] src/kvstore/./././gpu_topology.h:67: 2 0 0 0 0 0 2 0
[07:02:29] src/kvstore/./././gpu_topology.h:67: 0 2 0 0 0 0 0 2
[07:02:29] src/kvstore/./././gpu_topology.h:67: 0 0 0 0 0 2 2 0
[07:02:29] src/kvstore/./././gpu_topology.h:67: 0 0 0 0 2 0 0 2
[07:02:29] src/kvstore/./././gpu_topology.h:67: 0 0 2 0 2 0 0 0
[07:02:29] src/kvstore/./././gpu_topology.h:67: 0 0 0 2 0 2 0 0

[06:50:46] src/kvstore/././comm_tree.h:381: Using Kernighan-Lin to generate trees
[06:50:46] src/kvstore/./././gpu_topology.h:1030: No valid binary tree found from root 0, try backtracking
Traceback (most recent call last):
  File "run_pretraining.py", line 257, in <module>
    train()
  File "run_pretraining.py", line 228, in train
    trainer.step(1)
  File "/home/ubuntu/mxnet/python/mxnet/gluon/trainer.py", line 290, in step
    self._allreduce_grads()
  File "/home/ubuntu/mxnet/python/mxnet/gluon/trainer.py", line 320, in _allreduce_grads
    self._kvstore.push(i, param.list_grad(), priority=-i)
  File "/home/ubuntu/mxnet/python/mxnet/kvstore.py", line 237, in push
    self.handle, mx_uint(len(ckeys)), ckeys, cvals, ctypes.c_int(priority)))
  File "/home/ubuntu/mxnet/python/mxnet/base.py", line 252, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [06:50:46] src/kvstore/./././gpu_topology.h:1040: No valid binary tree found from root 0 using backtracking

szha commented 5 years ago

@eric-haibin-lin looks like this was resolved?

eric-haibin-lin commented 5 years ago

We have a fallback patch instead of a complete fix..

szha commented 5 years ago

This seems to be better tracked in MXNet.

kaonashi-tyc commented 5 years ago

Any update on the status?

dmlc / gluon-nlp

[BERT] Multi-GPU Training with Tree reduce #520