Closed liuyuisanai closed 6 years ago
@endernewton does this look like the problem that you ran into that seems to be an nvidia driver bug?
@rbgirshick yes it is exactly the same issue. some nvidia people are working on it and they already have reproduced the issue on their machines, hopefully it will be resolved soon.
@sciencefans for now please stick to pascal architectures as a backup plan.
@sciencefans For clarification, is your hang when NCCL is being used, or do you only hang when NCCL is disabled?
@slayton58 Both. If NCCL is disabled, Caffe2 hangs at context_gpu.cu:325 on nodes with V100 and Titan Xp,. On node with V100, Caffe2 will also hang if enable NCCL v1 or v2 as described in this issue. To this end, I could run it well only with Pascal GPU + NCCL enabled.
@sciencefans Thanks!
Please update your nvidia driver to 396.26. We have confirmed that it avoids this deadlock issue when using V100s with cuda >= 9.
Hello, @sciencefans. Have you fixed the deadlock problem? I also met a similar problem when using multi-gpu training on PyTorch and MXNet. This happened to both GeForce-GTX-1080-Ti and Tesla-V100-PCIE-16GB on CUDA 9.0 with driver version 384.81.
Same issue here, with CUDA10.0 and 4xV100 running MXNet
For anyone facing a similar issue and who landed on this page or if @rbgirshick's solution didn't work for you, check out this solution on a similar issue.
Tl;dr, disable IOMMU by changing/adding the line GRUB_CMDLINE_LINUX="iommu=soft"
to /etc/default/grub
and rebooting. This solved an issue with NCCL that presented the same symptoms for me after upgrading to driver v396.
so this issue cannot be solved by just rewriting my code or using DistributedDataParallel ?
Expected results
Training process should run well.
Actual results
Multi-GPU training process stucks randomly in somewhere and one or more gpu run on100% utility with low power usage.
Detailed steps to reproduce
I training retinanet with
R-50_1x
configuration and moditied the sigmoid focal loss to softmax focal loss. After several iterations, the process stucks and shows:And the NVIDIA-SMI shows like:
System information
python --version
output: 2.7.5