distributed training with horovod missing rank

riyijiye commented 5 years ago

in my training I saw below messages, not sure the impact. Anyone can help explain?

1469.worker.1 | [2019-06-04T16:49:20Z] [2019-06-04 16:49:19.737850: W horovod/common/operations.cc:588] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock. 1469.worker.1 | [2019-06-04T16:49:20Z] Stalled ops:Loss_Optimization/all_reduce/HorovodAllreduce_Loss_Optimization_mul_1_0 [missing ranks: 0] 1469.worker.1 | [2019-06-04T16:49:20Z] [2019-06-04 16:49:19.738018: W horovod/common/operations.cc:588] Loss_Optimization/all_reduce/HorovodAllreduce_Loss_Optimization_mul_2_0 [missing ranks: 0]

borisgin commented 5 years ago

Can you run nvidia-smi topo -m -p2p n please?

riyijiye commented 5 years ago

the jobs are submitted to a cluster of servers, with workers occupying different GPU codes of different servers, so hard for me to do this.

the issue is actually coupled with issue in another ticker https://github.com/NVIDIA/OpenSeq2Seq/issues/455

riyijiye commented 5 years ago

adding code to execute nvidia-smi command inside run.py code and is executed inside each worker, I actually see (taking worker 1 as example). As stated in ticket https://github.com/NVIDIA/OpenSeq2Seq/issues/455, training with 4GPU works.

1482.worker.1 | [2019-06-05T13:51:52Z] *** Using horovod^M 1482.worker.1 | [2019-06-05T13:51:52Z] ^[[4mGPU0 ^[[0m^M 1482.worker.1 | [2019-06-05T13:51:52Z] GPU0 X^M 1482.worker.1 | [2019-06-05T13:51:52Z] ^M 1482.worker.1 | [2019-06-05T13:51:52Z] Legend:^M 1482.worker.1 | [2019-06-05T13:51:52Z] ^M 1482.worker.1 | [2019-06-05T13:51:52Z] X = Self^M 1482.worker.1 | [2019-06-05T13:51:52Z] OK = Status Ok^M 1482.worker.1 | [2019-06-05T13:51:52Z] CNS = Chipset not supported^M 1482.worker.1 | [2019-06-05T13:51:52Z] GNS = GPU not supported^M 1482.worker.1 | [2019-06-05T13:51:52Z] TNS = Topology not supported^M 1482.worker.1 | [2019-06-05T13:51:52Z] NS = Not supported^M 1482.worker.1 | [2019-06-05T13:51:52Z] U = Unknown^M 1482.worker.1 | [2019-06-05T13:51:52Z] ^[[4mGPU0 CPU Affinity^[[0m^M 1482.worker.1 | [2019-06-05T13:51:52Z] Unexpected end of /proc/mounts line `overlay / overlay rw,relatime,lowerdir=/var/lib/docker/overlay2/l/BSPCZF4EIQG6QXDPEJGQFVCTQC:/var/lib/docker/overlay2/l/ALGYUTHK56DKUYFRTBGRSRVPG4:/var/lib/docker/overlay2/l/44NWXYBQFCHUE34JYSIOFA2EGR:/var/lib/docker/overlay2/l/7IJL3F2TBTLOD4EVFCLRJOYLEB:/var/lib/docker/overlay2/l/4YCRYWXVAZQQPICACQPWDJWLJR:/var/lib/docker/overlay2/l/LOJ2F4KGP2CRU6J5PT7MS45G32:/var/lib/docker/overlay2/l/IGV3ZY7LGFNRWQLKIROBEBRNQY:/var/lib/docker/overlay2/l/GR63ZUK2KO6EV5HMEKOITZWO6E:/var/lib/docker/overlay2/l/SHS5Y7N6TPDAO'

borisgin commented 5 years ago

Do you run job on one node with 16 GPUs or on multiple nodes?

riyijiye commented 5 years ago

multiple nodes

borisgin commented 5 years ago

Can you check if you can successfully run these nccl tests on these machines https://github.com/nvidia/nccl-tests please?

NVIDIA / OpenSeq2Seq

distributed training with horovod missing rank #456