NVIDIA / DIGITS

Deep Learning GPU Training System
https://developer.nvidia.com/digits
BSD 3-Clause "New" or "Revised" License
4.12k stars 1.38k forks source link

Error: Unable to connect to NVIDIA DIGITS #1294

Open OxInsky opened 7 years ago

OxInsky commented 7 years ago
 To fix this problem, try stopping and restarting the DIGITS server by running the following two commands:

    sudo stop nvidia-digits-server
    sudo start nvidia-digits-server

If you are still encountering problems after attempting to restart the DIGITS server, check /var/log/digits/digits.log for errors. 

I restarted the nvidia-digits-server ,but it didnot work .i cat the digits.log.it follows:

 2016-11-24 10:30:20 [19222] [INFO] Booting worker with pid: 19222
WARNING: Logging before InitGoogleLogging() is written to STDERR
E1124 11:04:38.543714 19222 common.cpp:110] Cannot create Cublas handle. Cublas won't be available.
E1124 11:04:38.600045 19222 common.cpp:117] Cannot create Curand generator. Curand won't be available.
E1124 11:04:38.623075 19222 common.cpp:121] Cannot create cuDNN handle. cuDNN won't be available.
F1124 11:04:38.625888 19222 syncedmem.hpp:19] Check failed: error == cudaSuccess (3 vs. 0)  initialization error
*** Check failure stack trace: ***
2016-11-24 11:04:39 [32356] [INFO] Booting worker with pid: 32356
2016-11-24 11:05:53 [22648] [INFO] Handling signal: term
2016-11-24 11:06:13 [553] [INFO] Starting gunicorn 17.5
2016-11-24 11:06:14 [553] [DEBUG] Arbiter booted
2016-11-24 11:06:14 [553] [INFO] Listening at: http://0.0.0.0:34448 (553)
2016-11-24 11:06:14 [553] [INFO] Using worker: socketio.sgunicorn.GeventSocketIOWorker
2016-11-24 11:06:14 [666] [INFO] Booting worker with pid: 666
WARNING: Logging before InitGoogleLogging() is written to STDERR
E1124 11:06:59.829854   666 common.cpp:110] Cannot create Cublas handle. Cublas won't be available.
E1124 11:06:59.847425   666 common.cpp:117] Cannot create Curand generator. Curand won't be available.
E1124 11:06:59.864856   666 common.cpp:121] Cannot create cuDNN handle. cuDNN won't be available.
F1124 11:06:59.869220   666 syncedmem.hpp:19] Check failed: error == cudaSuccess (3 vs. 0)  initialization error
*** Check failure stack trace: ***
2016-11-24 11:07:01 [1010] [INFO] Booting worker with pid: 1010
WARNING: Logging before InitGoogleLogging() is written to STDERR
E1124 11:07:33.873805  1010 common.cpp:110] Cannot create Cublas handle. Cublas won't be available.
E1124 11:07:33.891679  1010 common.cpp:117] Cannot create Curand generator. Curand won't be available.
E1124 11:07:33.910784  1010 common.cpp:121] Cannot create cuDNN handle. cuDNN won't be available.
F1124 11:07:33.915125  1010 syncedmem.hpp:19] Check failed: error == cudaSuccess (3 vs. 0)  initialization error
*** Check failure stack trace: ***
2016-11-24 11:07:35 [1259] [INFO] Booting worker with pid: 1259

I am a new bird! Can you help me solve this problem? thanks

lukeyeager commented 7 years ago

You've got some CUDA toolkit and/or driver issues. Can you make a standard CUDA sample?

What do these commands tell you?

OxInsky commented 7 years ago

nvidia-smi

Sun Dec 11 10:54:07 2016       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.27                 Driver Version: 367.27                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX TIT...  Off  | 0000:02:00.0      On |                  N/A |
| 22%   32C    P8    16W / 250W |    289MiB / 12204MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX TIT...  Off  | 0000:03:00.0     Off |                  N/A |
| 22%   33C    P8    14W / 250W |      3MiB / 12206MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX TIT...  Off  | 0000:82:00.0     Off |                  N/A |
| 22%   32C    P8    14W / 250W |      3MiB / 12206MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  GeForce GTX TIT...  Off  | 0000:83:00.0     Off |                  N/A |
| 22%   33C    P8    14W / 250W |      3MiB / 12206MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0      1235    C   /usr/bin/python                                106MiB |
|    0      1682    G   /usr/bin/X                                      94MiB |
|    0      3019    G   compiz                                          85MiB |
+-----------------------------------------------------------------------------+

./digits/device_query.py

Device #0: GeForce GTX TITAN X
                totalGlobalMem 12797476864
             sharedMemPerBlock 49152
                  regsPerBlock 65536
                      warpSize 32
                      memPitch 2147483647
            maxThreadsPerBlock 1024
                     clockRate 1076000
                 totalConstMem 65536
                         major 5
                         minor 2
              textureAlignment 512
         texturePitchAlignment 32
                 deviceOverlap 1
           multiProcessorCount 24
      kernelExecTimeoutEnabled 1
                    integrated 0
              canMapHostMemory 1
                   computeMode 0
                  maxTexture1D 65536
            maxTexture1DMipmap 16384
            maxTexture1DLinear 134217728
             maxTextureCubemap 16384
                  maxSurface1D 16384
             maxSurfaceCubemap 16384
              surfaceAlignment 512
             concurrentKernels 1
                    ECCEnabled 0
                      pciBusID 2
                   pciDeviceID 0
                   pciDomainID 0
                     tccDriver 0
              asyncEngineCount 2
             unifiedAddressing 1
               memoryClockRate 3505000
                memoryBusWidth 384
                   l2CacheSize 3145728
   maxThreadsPerMultiProcessor 2048
     streamPrioritiesSupported 1
        globalL1CacheSupported 1
         localL1CacheSupported 1
    sharedMemPerMultiprocessor 98304
         regsPerMultiprocessor 65536
           managedMemSupported 1
               isMultiGpuBoard 0
          multiGpuBoardGroupID 0
           Total memory (NVML) 12204 MB
            Used memory (NVML) 396 MB
            Free memory (NVML) 11808 MB
        GPU utilization (NVML) 0%
     Memory utilization (NVML) 0%
            Temperature (NVML) 32 C

Device #1: GeForce GTX TITAN X
                totalGlobalMem 12799574016
             sharedMemPerBlock 49152
                  regsPerBlock 65536
                      warpSize 32
                      memPitch 2147483647
            maxThreadsPerBlock 1024
                     clockRate 1076000
                 totalConstMem 65536
                         major 5
                         minor 2
              textureAlignment 512
         texturePitchAlignment 32
                 deviceOverlap 1
           multiProcessorCount 24
      kernelExecTimeoutEnabled 0
                    integrated 0
              canMapHostMemory 1
                   computeMode 0
                  maxTexture1D 65536
            maxTexture1DMipmap 16384
            maxTexture1DLinear 134217728
             maxTextureCubemap 16384
                  maxSurface1D 16384
             maxSurfaceCubemap 16384
              surfaceAlignment 512
             concurrentKernels 1
                    ECCEnabled 0
                      pciBusID 3
                   pciDeviceID 0
                   pciDomainID 0
                     tccDriver 0
              asyncEngineCount 2
             unifiedAddressing 1
               memoryClockRate 3505000
                memoryBusWidth 384
                   l2CacheSize 3145728
   maxThreadsPerMultiProcessor 2048
     streamPrioritiesSupported 1
        globalL1CacheSupported 1
         localL1CacheSupported 1
    sharedMemPerMultiprocessor 98304
         regsPerMultiprocessor 65536
           managedMemSupported 1
               isMultiGpuBoard 0
          multiGpuBoardGroupID 1
           Total memory (NVML) 12206 MB
            Used memory (NVML) 3 MB
            Free memory (NVML) 12203 MB
        GPU utilization (NVML) 0%
     Memory utilization (NVML) 0%
            Temperature (NVML) 33 C

Device #2: GeForce GTX TITAN X
                totalGlobalMem 12799574016
             sharedMemPerBlock 49152
                  regsPerBlock 65536
                      warpSize 32
                      memPitch 2147483647
            maxThreadsPerBlock 1024
                     clockRate 1076000
                 totalConstMem 65536
                         major 5
                         minor 2
              textureAlignment 512
         texturePitchAlignment 32
                 deviceOverlap 1
           multiProcessorCount 24
      kernelExecTimeoutEnabled 0
                    integrated 0
              canMapHostMemory 1
                   computeMode 0
                  maxTexture1D 65536
            maxTexture1DMipmap 16384
            maxTexture1DLinear 134217728
             maxTextureCubemap 16384
                  maxSurface1D 16384
             maxSurfaceCubemap 16384
              surfaceAlignment 512
             concurrentKernels 1
                    ECCEnabled 0
                      pciBusID 130
                   pciDeviceID 0
                   pciDomainID 0
                     tccDriver 0
              asyncEngineCount 2
             unifiedAddressing 1
               memoryClockRate 3505000
                memoryBusWidth 384
                   l2CacheSize 3145728
   maxThreadsPerMultiProcessor 2048
     streamPrioritiesSupported 1
        globalL1CacheSupported 1
         localL1CacheSupported 1
    sharedMemPerMultiprocessor 98304
         regsPerMultiprocessor 65536
           managedMemSupported 1
               isMultiGpuBoard 0
          multiGpuBoardGroupID 2
           Total memory (NVML) 12206 MB
            Used memory (NVML) 3 MB
            Free memory (NVML) 12203 MB
        GPU utilization (NVML) 0%
     Memory utilization (NVML) 0%
            Temperature (NVML) 33 C

Device #3: GeForce GTX TITAN X
                totalGlobalMem 12799574016
             sharedMemPerBlock 49152
                  regsPerBlock 65536
                      warpSize 32
                      memPitch 2147483647
            maxThreadsPerBlock 1024
                     clockRate 1076000
                 totalConstMem 65536
                         major 5
                         minor 2
              textureAlignment 512
         texturePitchAlignment 32
                 deviceOverlap 1
           multiProcessorCount 24
      kernelExecTimeoutEnabled 0
                    integrated 0
              canMapHostMemory 1
                   computeMode 0
                  maxTexture1D 65536
            maxTexture1DMipmap 16384
            maxTexture1DLinear 134217728
             maxTextureCubemap 16384
                  maxSurface1D 16384
             maxSurfaceCubemap 16384
              surfaceAlignment 512
             concurrentKernels 1
                    ECCEnabled 0
                      pciBusID 131
                   pciDeviceID 0
                   pciDomainID 0
                     tccDriver 0
              asyncEngineCount 2
             unifiedAddressing 1
               memoryClockRate 3505000
                memoryBusWidth 384
                   l2CacheSize 3145728
   maxThreadsPerMultiProcessor 2048
     streamPrioritiesSupported 1
        globalL1CacheSupported 1
         localL1CacheSupported 1
    sharedMemPerMultiprocessor 98304
         regsPerMultiprocessor 65536
           managedMemSupported 1
               isMultiGpuBoard 0
          multiGpuBoardGroupID 3
           Total memory (NVML) 12206 MB
            Used memory (NVML) 3 MB
            Free memory (NVML) 12203 MB
        GPU utilization (NVML) 0%
     Memory utilization (NVML) 0%
            Temperature (NVML) 33 C

dpkg -l | egrep 'nvidia|cudart|libcudnn|libnccl|caffe|torch|digits'

ii  caffe-nv                                              0.14.5-2+cuda7.5                                    amd64        Fast open framework for Deep Learning
ii  caffe-nv-tools                                        0.14.5-2+cuda7.5                                    amd64        Fast open framework for Deep Learning (Tools)
ii  cuda-cudart-7-5                                       7.5-18                                              amd64        CUDA Runtime native Libraries
ii  digits                                                3.0.0-1                                             amd64        NVIDIA DIGITS webserver
ii  libcaffe-nv0                                          0.14.5-2+cuda7.5                                    amd64        Fast open framework for Deep Learning (Libs)
ii  libcudnn4                                             4.0.7                                               amd64        cuDNN runtime libraries
ii  libcudnn4-dev                                         4.0.7                                               amd64        cuDNN development libraries and headers
ii  libcudnn5                                             5.0.6-1+cuda7.5                                     amd64        cuDNN runtime libraries
ii  libnccl1                                              1.2.1-1+cuda7.5                                     amd64        NVIDIA Communication Collectives Library (NCCL) Runtime
ii  nvidia-machine-learning-repo                          4.0-2                                               amd64        NVIDIA Deep Learning Packages
ii  python-caffe-nv                                       0.14.5-2+cuda7.5                                    amd64        Fast open framework for Deep Learning (Python)
ii  torch7-nv                                             0.9.98-1+cuda7.5                                    amd64        NVidia Torch Bundle (with CUDA). Made for DIGITS.

The CUDA-Tootkits may be good! because we usually use the GPUs to accelerate our training of the net.it does well! I'm sorry to return your letter later. Because the git did not push the information about the answer that i think it may do it at first. Hope to your return ,thanks!

lukeyeager commented 7 years ago

You've got a CUDA 8.0 RC driver - you might want to try updating to a proper release driver? https://github.com/NVIDIA/nvidia-docker/wiki/CUDA#requirements

Also, why not upgrade to DIGITS 4 at least? DIGITS 3 is pretty old. https://github.com/NVIDIA/DIGITS/releases/tag/v3.0.0

OxInsky commented 7 years ago

Updating to the Digits 4 can solve my problem?Does it have a influence for the training of the net with the GPUs? and how to do it? thanks for a lot! PS:i cannot update the driver,because there are many people use it in my team。

lukeyeager commented 7 years ago

You already have access to the 3.0 debs, so I expect you also have access to the 4.0 debs.

sudo apt-get update
sudo apt-get upgrade