distributed training with horovod global step not increased

NVIDIA / OpenSeq2Seq

Toolkit for efficient experimentation with Speech Recognition, Text2Speech and NLP

https://nvidia.github.io/OpenSeq2Seq

Apache License 2.0

1.55k stars 369 forks source link

distributed training with horovod global step not increased #455

Open riyijiye opened 5 years ago

riyijiye commented 5 years ago

running distributed training gives below issue,

1462.worker.1 | [2019-06-04T14:59:47Z] WARNING:tensorflow:It seems that global step (tf.train.get_global_step) has not been increased. Current value (could be stable): 1 vs previous value: 1. You could increase the global step by passing tf.train.get_global_step() to Optimizer.apply_gradients or Optimizer.minimize.

any idea? thanks!

borisgin commented 5 years ago

Can you attach the complete log, please?

riyijiye commented 5 years ago

bad.log

borisgin commented 5 years ago

Which container do you use?

riyijiye commented 5 years ago

nvcr.io/nvidia/tensorflow:19.05-py3. it should not be container issue. when running training with 4 GPUs, I do not see issue. the issue is for 16 GPUs

borisgin commented 5 years ago

strange, I see only 4 workers in the log which you attached.

riyijiye commented 5 years ago

oh that log is for 4 GPU, in case of 4gpu sometimes it happened to end up with such issue, sometimes it works without problem, but never for 16GPU.

borisgin commented 5 years ago

So you are running training on 4 nodes with identical GPUs and the same driver version?

riyijiye commented 5 years ago

I am not 100% sure about this, I need to ask our IT support for this. So horovod only works in case of identical gpus and same driver version right?

borisgin commented 5 years ago

Container can require certain driver version. So if node has wrong driver version, then container can fail on this node

riyijiye commented 5 years ago

the container nvcr.io/nvidia/tensorflow:19.05-py3 I am using is based on tensorflow 1.13 and cuda 10. under the list of containers here https://ngc.nvidia.com/catalog/containers/nvidia:tensorflow, which one is for tensorflow 1.12 and cuda 9?

borisgin commented 5 years ago

can you try 19.02 and send me a log please?

riyijiye commented 5 years ago

19.02 has some issue, I pulled 19.01-py3 and it is based on tensorflow 1.12 and cuda 10. I need to find out container with cuda9.0 to try.