Multi-GPU training process stuck randomly

liuyuisanai commented 6 years ago

Expected results

Training process should run well.

Actual results

Multi-GPU training process stucks randomly in somewhere and one or more gpu run on100% utility with low power usage.

Detailed steps to reproduce

I training retinanet with R-50_1x configuration and moditied the sigmoid focal loss to softmax focal loss. After several iterations, the process stucks and shows:

......
json_stats: {"eta": "10:51:09", "fl_fpn3": 0.106909, "fl_fpn4": 0.110105, "fl_fpn5": 0.127377, "fl_fpn6": 0.102251, "fl_fpn7": 0.056136, "iter": 11000, "loss": 0.737872, "lr": 0.002000, "mb_qsize": 0, "mem": 6962, "retnet_bg_num": 59366331.000000, "retnet_fg_num": 327.125000, "retnet_loss_bbox_fpn3": 0.032386, "retnet_loss_bbox_fpn4": 0.035563, "retnet_loss_bbox_fpn5": 0.042428, "retnet_loss_bbox_fpn6": 0.041779, "retnet_loss_bbox_fpn7": 0.022579, "time": 0.494551}
json_stats: {"eta": "10:49:23", "fl_fpn3": 0.111536, "fl_fpn4": 0.131841, "fl_fpn5": 0.120292, "fl_fpn6": 0.097731, "fl_fpn7": 0.044005, "iter": 11020, "loss": 0.683587, "lr": 0.002000, "mb_qsize": 0, "mem": 6962, "retnet_bg_num": 59366757.000000, "retnet_fg_num": 342.500000, "retnet_loss_bbox_fpn3": 0.038253, "retnet_loss_bbox_fpn4": 0.049801, "retnet_loss_bbox_fpn5": 0.041532, "retnet_loss_bbox_fpn6": 0.036760, "retnet_loss_bbox_fpn7": 0.017812, "time": 0.493332}

And the NVIDIA-SMI shows like:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.81                 Driver Version: 384.81                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  On   | 00000000:04:00.0 Off |                    0 |
| N/A   30C    P0    34W / 250W |   7998MiB / 16152MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-PCIE...  On   | 00000000:05:00.0 Off |                    0 |
| N/A   30C    P0    33W / 250W |   7816MiB / 16152MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-PCIE...  On   | 00000000:08:00.0 Off |                    0 |
| N/A   30C    P0    35W / 250W |   7796MiB / 16152MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-PCIE...  On   | 00000000:09:00.0 Off |                    0 |
| N/A   30C    P0    33W / 250W |   7926MiB / 16152MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   4  Tesla V100-PCIE...  On   | 00000000:85:00.0 Off |                    0 |
| N/A   31C    P0    35W / 250W |   7810MiB / 16152MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   5  Tesla V100-PCIE...  On   | 00000000:86:00.0 Off |                    0 |
| N/A   32C    P0    46W / 250W |   7926MiB / 16152MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
|   6  Tesla V100-PCIE...  On   | 00000000:89:00.0 Off |                    0 |
| N/A   30C    P0    34W / 250W |   7824MiB / 16152MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   7  Tesla V100-PCIE...  On   | 00000000:8A:00.0 Off |                    0 |
| N/A   30C    P0    35W / 250W |   7826MiB / 16152MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     36315      C   /bin/python2                                7988MiB |
|    1     36315      C   /bin/python2                                7806MiB |
|    2     36315      C   /bin/python2                                7786MiB |
|    3     36315      C   /bin/python2                                7916MiB |
|    4     36315      C   /bin/python2                                7800MiB |
|    5     36315      C   /bin/python2                                7916MiB |
|    6     36315      C   /bin/python2                                7814MiB |
|    7     36315      C   /bin/python2                                7816MiB |
+-----------------------------------------------------------------------------+

System information

Operating system: Ubuntu 14.04
CUDA version: 9.0
cuDNN version: 7.1
NVIDIA driver version: 384.81
python --version output: 2.7.5
Anything else that seems relevant: ? 1) I use all 8 GPUs on this node to run this task. 2) I turn on NCCL for All-reduce. If I turn it off, the training process will stuck at a very early point in atcontext_gpu.cu:325 and the node will crash, but I believe this should be another issue.

rbgirshick commented 6 years ago

@endernewton does this look like the problem that you ran into that seems to be an nvidia driver bug?

endernewton commented 6 years ago

@rbgirshick yes it is exactly the same issue. some nvidia people are working on it and they already have reproduced the issue on their machines, hopefully it will be resolved soon.

@sciencefans for now please stick to pascal architectures as a backup plan.

slayton58 commented 6 years ago

@sciencefans For clarification, is your hang when NCCL is being used, or do you only hang when NCCL is disabled?

liuyuisanai commented 6 years ago

@slayton58 Both. If NCCL is disabled, Caffe2 hangs at context_gpu.cu:325 on nodes with V100 and Titan Xp,. On node with V100, Caffe2 will also hang if enable NCCL v1 or v2 as described in this issue. To this end, I could run it well only with Pascal GPU + NCCL enabled.

slayton58 commented 6 years ago

@sciencefans Thanks!

rbgirshick commented 6 years ago

Please update your nvidia driver to 396.26. We have confirmed that it avoids this deadlock issue when using V100s with cuda >= 9.

jianchao-li commented 5 years ago

Hello, @sciencefans. Have you fixed the deadlock problem? I also met a similar problem when using multi-gpu training on PyTorch and MXNet. This happened to both GeForce-GTX-1080-Ti and Tesla-V100-PCIE-16GB on CUDA 9.0 with driver version 384.81.

ThomasDelteil commented 5 years ago

Same issue here, with CUDA10.0 and 4xV100 running MXNet

Shappenny commented 5 years ago

For anyone facing a similar issue and who landed on this page or if @rbgirshick's solution didn't work for you, check out this solution on a similar issue. Tl;dr, disable IOMMU by changing/adding the line GRUB_CMDLINE_LINUX="iommu=soft" to /etc/default/grub and rebooting. This solved an issue with NCCL that presented the same symptoms for me after upgrading to driver v396.

winnechan commented 5 years ago

so this issue cannot be solved by just rewriting my code or using DistributedDataParallel ?

facebookresearch / Detectron