Open riyijiye opened 5 years ago
Can you run nvidia-smi topo -m -p2p n
please?
the jobs are submitted to a cluster of servers, with workers occupying different GPU codes of different servers, so hard for me to do this.
the issue is actually coupled with issue in another ticker https://github.com/NVIDIA/OpenSeq2Seq/issues/455
adding code to execute nvidia-smi command inside run.py code and is executed inside each worker, I actually see (taking worker 1 as example). As stated in ticket https://github.com/NVIDIA/OpenSeq2Seq/issues/455, training with 4GPU works.
1482.worker.1 | [2019-06-05T13:51:52Z] *** Using horovod^M 1482.worker.1 | [2019-06-05T13:51:52Z] ^[[4mGPU0 ^[[0m^M 1482.worker.1 | [2019-06-05T13:51:52Z] GPU0 X^M 1482.worker.1 | [2019-06-05T13:51:52Z] ^M 1482.worker.1 | [2019-06-05T13:51:52Z] Legend:^M 1482.worker.1 | [2019-06-05T13:51:52Z] ^M 1482.worker.1 | [2019-06-05T13:51:52Z] X = Self^M 1482.worker.1 | [2019-06-05T13:51:52Z] OK = Status Ok^M 1482.worker.1 | [2019-06-05T13:51:52Z] CNS = Chipset not supported^M 1482.worker.1 | [2019-06-05T13:51:52Z] GNS = GPU not supported^M 1482.worker.1 | [2019-06-05T13:51:52Z] TNS = Topology not supported^M 1482.worker.1 | [2019-06-05T13:51:52Z] NS = Not supported^M 1482.worker.1 | [2019-06-05T13:51:52Z] U = Unknown^M 1482.worker.1 | [2019-06-05T13:51:52Z] ^[[4mGPU0 CPU Affinity^[[0m^M 1482.worker.1 | [2019-06-05T13:51:52Z] Unexpected end of /proc/mounts line `overlay / overlay rw,relatime,lowerdir=/var/lib/docker/overlay2/l/BSPCZF4EIQG6QXDPEJGQFVCTQC:/var/lib/docker/overlay2/l/ALGYUTHK56DKUYFRTBGRSRVPG4:/var/lib/docker/overlay2/l/44NWXYBQFCHUE34JYSIOFA2EGR:/var/lib/docker/overlay2/l/7IJL3F2TBTLOD4EVFCLRJOYLEB:/var/lib/docker/overlay2/l/4YCRYWXVAZQQPICACQPWDJWLJR:/var/lib/docker/overlay2/l/LOJ2F4KGP2CRU6J5PT7MS45G32:/var/lib/docker/overlay2/l/IGV3ZY7LGFNRWQLKIROBEBRNQY:/var/lib/docker/overlay2/l/GR63ZUK2KO6EV5HMEKOITZWO6E:/var/lib/docker/overlay2/l/SHS5Y7N6TPDAO'
Do you run job on one node with 16 GPUs or on multiple nodes?
multiple nodes
Can you check if you can successfully run these nccl tests on these machines https://github.com/nvidia/nccl-tests please?
in my training I saw below messages, not sure the impact. Anyone can help explain?
1469.worker.1 | [2019-06-04T16:49:20Z] [2019-06-04 16:49:19.737850: W horovod/common/operations.cc:588] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock. 1469.worker.1 | [2019-06-04T16:49:20Z] Stalled ops:Loss_Optimization/all_reduce/HorovodAllreduce_Loss_Optimization_mul_1_0 [missing ranks: 0] 1469.worker.1 | [2019-06-04T16:49:20Z] [2019-06-04 16:49:19.738018: W horovod/common/operations.cc:588] Loss_Optimization/all_reduce/HorovodAllreduce_Loss_Optimization_mul_2_0 [missing ranks: 0]