Tencent / PocketFlow

An Automatic Model Compression (AutoMC) framework for developing smaller and faster AI applications.
https://pocketflow.github.io
Other
2.79k stars 490 forks source link

can't runing multi-GPU #176

Open liaocz opened 5 years ago

liaocz commented 5 years ago

when I use one GPU and it finished without any problem , but when using multi-GPU, it hung when runing bcast operation, I don't know how to solve it. code: channel_pruning_gpu/learner.py:149

jiaxiang-wu commented 5 years ago

Which environment are you using? Horovod in the native environment, or via a docker image?

liaocz commented 5 years ago

native environment

jiaxiang-wu commented 5 years ago

You may need to check whether Horovod's network options are set properly ("eth1" parts), according to your native environment's network configuration.

  options="-np ${nb_gpus} -H localhost:${nb_gpus} -bind-to none -map-by slot
      -x NCCL_DEBUG=INFO -x NCCL_SOCKET_IFNAME=eth1 -x NCCL_IB_DISABLE=1
      -x LD_LIBRARY_PATH --mca btl_tcp_if_include eth1"
  mpirun ${options} python main.py --enbl_multi_gpu ${extra_args}
liaocz commented 5 years ago

thank you for your answering, i have exported the environment NCCL_SOCKET_IFNAME=eth1 but it's not working for me, it still hang when i using 2 GPU on one node. if i comment the bcast, it will continue running,do you have any idea?

jinhou commented 5 years ago

thank you for your answering, i have exported the environment NCCL_SOCKET_IFNAME=eth1 but it's not working for me, it still hang when i using 2 GPU on one node. if i comment the bcast, it will continue running,do you have any idea?

Hi liaocz, Could you paste the log file? So we can help to figure out root cause.

liaocz commented 5 years ago

2019-01-17 11:30:31.451159: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0 2019-01-17 11:30:31.451169: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0: N 2019-01-17 11:30:31.451695: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 21546 MB memory) -> physical GPU (device: 0, name: Tesla P40, pci bus id: 0000:83:00.0, compute capability: 6.1) INFO:tensorflow:creating a pruning mask for pruned_model/resnet_model/conv2d/kernel:0 of size (7, 7, 3, 64) INFO:tensorflow:creating a pruning mask for pruned_model/resnet_model/conv2d_1/kernel:0 of size (1, 1, 64, 64) INFO:tensorflow:creating a pruning mask for pruned_model/resnet_model/conv2d_2/kernel:0 of size (3, 3, 64, 64) INFO:tensorflow:creating a pruning mask for pruned_model/resnet_model/conv2d_3/kernel:0 of size (3, 3, 64, 64) INFO:tensorflow:creating a pruning mask for pruned_model/resnet_model/conv2d_4/kernel:0 of size (3, 3, 64, 64) INFO:tensorflow:creating a pruning mask for pruned_model/resnet_model/conv2d_5/kernel:0 of size (3, 3, 64, 64) INFO:tensorflow:creating a pruning mask for pruned_model/resnet_model/conv2d_6/kernel:0 of size (1, 1, 64, 128) INFO:tensorflow:creating a pruning mask for pruned_model/resnet_model/conv2d_7/kernel:0 of size (3, 3, 64, 128) INFO:tensorflow:creating a pruning mask for pruned_model/resnet_model/conv2d_8/kernel:0 of size (3, 3, 128, 128) INFO:tensorflow:creating a pruning mask for pruned_model/resnet_model/conv2d_9/kernel:0 of size (3, 3, 128, 128) INFO:tensorflow:creating a pruning mask for pruned_model/resnet_model/conv2d_10/kernel:0 of size (3, 3, 128, 128) INFO:tensorflow:creating a pruning mask for pruned_model/resnet_model/conv2d_11/kernel:0 of size (1, 1, 128, 256) INFO:tensorflow:creating a pruning mask for pruned_model/resnet_model/conv2d_12/kernel:0 of size (3, 3, 128, 256) INFO:tensorflow:creating a pruning mask for pruned_model/resnet_model/conv2d_13/kernel:0 of size (3, 3, 256, 256) INFO:tensorflow:creating a pruning mask for pruned_model/resnet_model/conv2d_14/kernel:0 of size (3, 3, 256, 256) INFO:tensorflow:creating a pruning mask for pruned_model/resnet_model/conv2d_15/kernel:0 of size (3, 3, 256, 256) INFO:tensorflow:creating a pruning mask for pruned_model/resnet_model/conv2d_16/kernel:0 of size (1, 1, 256, 512) INFO:tensorflow:creating a pruning mask for pruned_model/resnet_model/conv2d_17/kernel:0 of size (3, 3, 256, 512) INFO:tensorflow:creating a pruning mask for pruned_model/resnet_model/conv2d_18/kernel:0 of size (3, 3, 512, 512) INFO:tensorflow:creating a pruning mask for pruned_model/resnet_model/conv2d_19/kernel:0 of size (3, 3, 512, 512) INFO:tensorflow:creating a pruning mask for pruned_model/resnet_model/conv2d_20/kernel:0 of size (3, 3, 512, 512) 2019-01-17 11:30:38.546495: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1484] Adding visible gpu devices: 0 2019-01-17 11:30:38.546564: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-01-17 11:30:38.546575: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0 2019-01-17 11:30:38.546582: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0: N 2019-01-17 11:30:38.546796: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 21546 MB memory) -> physical GPU (device: 0, name: Tesla P40, pci bus id: 0000:83:00.0, compute capability: 6.1) INFO:tensorflow:begin restoring model from checkpoint file INFO:tensorflow:/mnt/PocketFlow/pretrain_models/models INFO:tensorflow:/mnt/PocketFlow/pretrain_models/models/model.ckpt-250227 INFO:tensorflow:Restoring parameters from /mnt/PocketFlow/pretrain_models/models/model.ckpt-250227 INFO:tensorflow:finish restoring model from checkpoint file INFO:tensorflow:name: "group_deps"

完成了checkpoint file的restoring后就hang住了

jiaxiang-wu commented 5 years ago

@liaocz We do not have a clue for the moment. This is more like a horovod-related issue. Maybe you can find some help here? https://github.com/uber/horovod