NVIDIA / DALI

A GPU-accelerated library containing highly optimized building blocks and an execution engine for data processing to accelerate deep learning training and inference applications.
https://docs.nvidia.com/deeplearning/dali/user-guide/docs/index.html
Apache License 2.0
5.06k stars 615 forks source link

Cannot access CUDA GPU on WSL #5462

Open benchd opened 3 months ago

benchd commented 3 months ago

Version

nvidia-dali-cuda120:1.37.1, nvidia-dali-nightly-cuda120 1.38.0.dev20240507

Describe the bug.

I've been following https://github.com/NVIDIA/DALI/issues/4663 and I'm seeing something similar but cannot figure out why. I can access my gpu on device 0 using nvidia-smi and I can access it using the same conda environment with pytorch so I'm unclear why dali cannot. This is inside a conda environment inside wsl on windows

Minimum reproducible example

Conda envionment:
name: multilabelimage_model_env
channels:
  - pytorch
  - nvidia
  - conda-forge
  - defaults
dependencies:
  - python=3.11
  - pytorch
  - torchvision
  - torchaudio
  - pytorch-cuda=12.1
  - opencv
  - pandas
  - scikit-learn=1.4.0
  - wandb
  - matplotlib
  - tqdm
  - pillow
  - numpy
  - scipy
  - pyyaml
  - pip
  - pip:
      - torch-summary
      - tensorboard
      - torch-tb-profiler
      - torch-geometric
      - timm

installed DALI using the official installation guide: 
pip install --extra-index-url https://pypi.nvidia.com --upgrade nvidia-dali-cuda120

Also tried with nightly build

Tested with minimal example:
`import nvidia.dali as dali
import numpy as np
@dali.pipeline_def
def my_pipe():
  return dali.fn.external_source(np.array([1,2,3], dtype=np.float32), batch=False).gpu()

pipe = my_pipe(batch_size=1, num_threads=1, device_id=1)
pipe.build()
print(pipe.run())
`

Relevant log output

Minimal example above gets error:

python dali_test.py
/root/miniconda3/envs/multilabelimage_model_env/lib/python3.11/site-packages/nvidia/dali/backend.py:99: Warning: nvidia-dali-cuda120 is no longer shipped with CUDA runtime. You need to install it separately. cuFFT is typically provided with CUDA Toolkit installation or an appropriate wheel. Please check https://docs.nvidia.com/cuda/cuda-quick-start-guide/index.html#pip-wheels-installation-linux for the reference.
  deprecation_warning(
/root/miniconda3/envs/multilabelimage_model_env/lib/python3.11/site-packages/nvidia/dali/backend.py:110: Warning: nvidia-dali-cuda120 is no longer shipped with CUDA runtime. You need to install it separately. NPP is typically provided with CUDA Toolkit installation or an appropriate wheel. Please check https://docs.nvidia.com/cuda/cuda-quick-start-guide/index.html#pip-wheels-installation-linux for the reference.
  deprecation_warning(
/root/miniconda3/envs/multilabelimage_model_env/lib/python3.11/site-packages/nvidia/dali/backend.py:121: Warning: nvidia-dali-cuda120 is no longer shipped with CUDA runtime. You need to install it separately. nvJPEG is typically provided with CUDA Toolkit installation or an appropriate wheel. Please check https://docs.nvidia.com/cuda/cuda-quick-start-guide/index.html#pip-wheels-installation-linux for the reference.
  deprecation_warning(
Traceback (most recent call last):
  File "/mnt/c/Coding/Testing/PyTorch/MultiLabelClassification_Patreon/actual_real_user_code/dali_test.py", line 8, in <module>
    pipe.build()
  File "/root/miniconda3/envs/multilabelimage_model_env/lib/python3.11/site-packages/nvidia/dali/pipeline.py", line 979, in build
    self._init_pipeline_backend()
  File "/root/miniconda3/envs/multilabelimage_model_env/lib/python3.11/site-packages/nvidia/dali/pipeline.py", line 813, in _init_pipeline_backend
    self._pipe = b.Pipeline(
                 ^^^^^^^^^^^
RuntimeError: CUDA runtime API error cudaErrorInvalidDevice (101):
invalid device ordinal

Other/Misc.

Found similar issues but could not find a solution

Check for duplicates

mzient commented 3 months ago

Hello @benchd, Please check your device id. You said you can access "device 0", but your DALI snippet specifies device 1.

pipe = my_pipe(batch_size=1, num_threads=1, device_id=1)
                                            ^^^^^^^^^^^