Open riyijiye opened 5 years ago
Can you attach the complete log, please?
Which container do you use?
nvcr.io/nvidia/tensorflow:19.05-py3. it should not be container issue. when running training with 4 GPUs, I do not see issue. the issue is for 16 GPUs
strange, I see only 4 workers in the log which you attached.
oh that log is for 4 GPU, in case of 4gpu sometimes it happened to end up with such issue, sometimes it works without problem, but never for 16GPU.
So you are running training on 4 nodes with identical GPUs and the same driver version?
I am not 100% sure about this, I need to ask our IT support for this. So horovod only works in case of identical gpus and same driver version right?
Container can require certain driver version. So if node has wrong driver version, then container can fail on this node
the container nvcr.io/nvidia/tensorflow:19.05-py3 I am using is based on tensorflow 1.13 and cuda 10. under the list of containers here https://ngc.nvidia.com/catalog/containers/nvidia:tensorflow, which one is for tensorflow 1.12 and cuda 9?
can you try 19.02 and send me a log please?
19.02 has some issue, I pulled 19.01-py3 and it is based on tensorflow 1.12 and cuda 10. I need to find out container with cuda9.0 to try.
running distributed training gives below issue,
1462.worker.1 | [2019-06-04T14:59:47Z] WARNING:tensorflow:It seems that global step (tf.train.get_global_step) has not been increased. Current value (could be stable): 1 vs previous value: 1. You could increase the global step by passing tf.train.get_global_step() to Optimizer.apply_gradients or Optimizer.minimize.
any idea? thanks!