CUDA issue on the Compute Instance

PyTorch cannot detect GPUs on the Compute Instance with NVIDIA A100 GPUs. nvidia-smi and nvcc --version can return the installed CUDA version and CUDA toolkit versions. But PyTorch returns the following:

# python -c "import torch; print(torch.cuda.is_available())"
/opt/conda/lib/python3.8/site-packages/torch/cuda/__init__.py:80: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 802: system not yet initialized (Triggered internally at  /opt/pytorch/pytorch/c10/cuda/CUDAFunctions.cpp:112.)
  return torch._C._cuda_getDeviceCount() > 0
False

I tried to pull PyTorch container from Nvidia NGC, AML base images, or compile PyTorch using pip/conda with GPU support. All have the same CUDA issue. I also cannot compile NCCL test on the Compute Instance with GPUs.

However, everything works fine on the Computer Cluster VMs. NCCL compiles/runs file and PyTorch has no issue with CUDA. I suspect CUDA is installed but not configured correctly on the Compute Instance VMs.

Azure / MachineLearningNotebooks

CUDA issue on the Compute Instance #1839