Azure / MachineLearningNotebooks

Python notebooks with ML and deep learning examples with Azure Machine Learning Python SDK | Microsoft
https://docs.microsoft.com/azure/machine-learning/service/
MIT License
4.07k stars 2.52k forks source link

CUDA issue on the Compute Instance #1839

Open JingchaoZhang opened 1 year ago

JingchaoZhang commented 1 year ago

PyTorch cannot detect GPUs on the Compute Instance with NVIDIA A100 GPUs. nvidia-smi and nvcc --version can return the installed CUDA version and CUDA toolkit versions. But PyTorch returns the following:

# python -c "import torch; print(torch.cuda.is_available())"
/opt/conda/lib/python3.8/site-packages/torch/cuda/__init__.py:80: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 802: system not yet initialized (Triggered internally at  /opt/pytorch/pytorch/c10/cuda/CUDAFunctions.cpp:112.)
  return torch._C._cuda_getDeviceCount() > 0
False

I tried to pull PyTorch container from Nvidia NGC, AML base images, or compile PyTorch using pip/conda with GPU support. All have the same CUDA issue. I also cannot compile NCCL test on the Compute Instance with GPUs.

However, everything works fine on the Computer Cluster VMs. NCCL compiles/runs file and PyTorch has no issue with CUDA. I suspect CUDA is installed but not configured correctly on the Compute Instance VMs.