Open liaocz opened 5 years ago
Which environment are you using? Horovod in the native environment, or via a docker image?
native environment
You may need to check whether Horovod's network options are set properly ("eth1" parts), according to your native environment's network configuration.
options="-np ${nb_gpus} -H localhost:${nb_gpus} -bind-to none -map-by slot
-x NCCL_DEBUG=INFO -x NCCL_SOCKET_IFNAME=eth1 -x NCCL_IB_DISABLE=1
-x LD_LIBRARY_PATH --mca btl_tcp_if_include eth1"
mpirun ${options} python main.py --enbl_multi_gpu ${extra_args}
thank you for your answering, i have exported the environment NCCL_SOCKET_IFNAME=eth1 but it's not working for me, it still hang when i using 2 GPU on one node. if i comment the bcast, it will continue running,do you have any idea?
thank you for your answering, i have exported the environment NCCL_SOCKET_IFNAME=eth1 but it's not working for me, it still hang when i using 2 GPU on one node. if i comment the bcast, it will continue running,do you have any idea?
Hi liaocz, Could you paste the log file? So we can help to figure out root cause.
2019-01-17 11:30:31.451159: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0 2019-01-17 11:30:31.451169: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0: N 2019-01-17 11:30:31.451695: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 21546 MB memory) -> physical GPU (device: 0, name: Tesla P40, pci bus id: 0000:83:00.0, compute capability: 6.1) INFO:tensorflow:creating a pruning mask for pruned_model/resnet_model/conv2d/kernel:0 of size (7, 7, 3, 64) INFO:tensorflow:creating a pruning mask for pruned_model/resnet_model/conv2d_1/kernel:0 of size (1, 1, 64, 64) INFO:tensorflow:creating a pruning mask for pruned_model/resnet_model/conv2d_2/kernel:0 of size (3, 3, 64, 64) INFO:tensorflow:creating a pruning mask for pruned_model/resnet_model/conv2d_3/kernel:0 of size (3, 3, 64, 64) INFO:tensorflow:creating a pruning mask for pruned_model/resnet_model/conv2d_4/kernel:0 of size (3, 3, 64, 64) INFO:tensorflow:creating a pruning mask for pruned_model/resnet_model/conv2d_5/kernel:0 of size (3, 3, 64, 64) INFO:tensorflow:creating a pruning mask for pruned_model/resnet_model/conv2d_6/kernel:0 of size (1, 1, 64, 128) INFO:tensorflow:creating a pruning mask for pruned_model/resnet_model/conv2d_7/kernel:0 of size (3, 3, 64, 128) INFO:tensorflow:creating a pruning mask for pruned_model/resnet_model/conv2d_8/kernel:0 of size (3, 3, 128, 128) INFO:tensorflow:creating a pruning mask for pruned_model/resnet_model/conv2d_9/kernel:0 of size (3, 3, 128, 128) INFO:tensorflow:creating a pruning mask for pruned_model/resnet_model/conv2d_10/kernel:0 of size (3, 3, 128, 128) INFO:tensorflow:creating a pruning mask for pruned_model/resnet_model/conv2d_11/kernel:0 of size (1, 1, 128, 256) INFO:tensorflow:creating a pruning mask for pruned_model/resnet_model/conv2d_12/kernel:0 of size (3, 3, 128, 256) INFO:tensorflow:creating a pruning mask for pruned_model/resnet_model/conv2d_13/kernel:0 of size (3, 3, 256, 256) INFO:tensorflow:creating a pruning mask for pruned_model/resnet_model/conv2d_14/kernel:0 of size (3, 3, 256, 256) INFO:tensorflow:creating a pruning mask for pruned_model/resnet_model/conv2d_15/kernel:0 of size (3, 3, 256, 256) INFO:tensorflow:creating a pruning mask for pruned_model/resnet_model/conv2d_16/kernel:0 of size (1, 1, 256, 512) INFO:tensorflow:creating a pruning mask for pruned_model/resnet_model/conv2d_17/kernel:0 of size (3, 3, 256, 512) INFO:tensorflow:creating a pruning mask for pruned_model/resnet_model/conv2d_18/kernel:0 of size (3, 3, 512, 512) INFO:tensorflow:creating a pruning mask for pruned_model/resnet_model/conv2d_19/kernel:0 of size (3, 3, 512, 512) INFO:tensorflow:creating a pruning mask for pruned_model/resnet_model/conv2d_20/kernel:0 of size (3, 3, 512, 512) 2019-01-17 11:30:38.546495: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1484] Adding visible gpu devices: 0 2019-01-17 11:30:38.546564: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-01-17 11:30:38.546575: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0 2019-01-17 11:30:38.546582: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0: N 2019-01-17 11:30:38.546796: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 21546 MB memory) -> physical GPU (device: 0, name: Tesla P40, pci bus id: 0000:83:00.0, compute capability: 6.1) INFO:tensorflow:begin restoring model from checkpoint file INFO:tensorflow:/mnt/PocketFlow/pretrain_models/models INFO:tensorflow:/mnt/PocketFlow/pretrain_models/models/model.ckpt-250227 INFO:tensorflow:Restoring parameters from /mnt/PocketFlow/pretrain_models/models/model.ckpt-250227 INFO:tensorflow:finish restoring model from checkpoint file INFO:tensorflow:name: "group_deps"
完成了checkpoint file的restoring后就hang住了
@liaocz We do not have a clue for the moment. This is more like a horovod-related issue. Maybe you can find some help here? https://github.com/uber/horovod
when I use one GPU and it finished without any problem , but when using multi-GPU, it hung when runing bcast operation, I don't know how to solve it. code: channel_pruning_gpu/learner.py:149