NVIDIA / DIGITS

Deep Learning GPU Training System
https://developer.nvidia.com/digits
BSD 3-Clause "New" or "Revised" License
4.12k stars 1.38k forks source link

waiting forever with multi-GPU #1495

Open robotsoft opened 7 years ago

robotsoft commented 7 years ago

Hello,

I've tried to follow Using DIGITS to train an Object Detection network. When I clicked the create button for training DetectNet with DIGITS with multi-GPU (i.e. 2, or 3 or 4 GPUs), job status is "waiting" forever. When I chose only single GPU, job status is "running" but estimated time is about 20 hours. Dataset is KITTI object dataset as mentioned in the tutorial. I installed nv-deep-learning-repo-ubuntu1404-ga-cuda8.0-cudnn5.1.10_1-1_amd64.deb. DIGITS version is 4.0.0 and Caffe version 0.5.13. Any comment would be appreciated.

jogiji commented 4 years ago

Probably you need to check your system bios whether it has IOMMU enabled or disabled. I faced a similar issue and now my Multi Gpu training is working when i disabled IOMMU