PyTorch cannot detect GPUs on the Compute Instance with NVIDIA A100 GPUs. nvidia-smi and nvcc --version can return the installed CUDA version and CUDA toolkit versions. But PyTorch returns the following:
# python -c "import torch; print(torch.cuda.is_available())"
/opt/conda/lib/python3.8/site-packages/torch/cuda/__init__.py:80: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 802: system not yet initialized (Triggered internally at /opt/pytorch/pytorch/c10/cuda/CUDAFunctions.cpp:112.)
return torch._C._cuda_getDeviceCount() > 0
False
I tried to pull PyTorch container from Nvidia NGC, AML base images, or compile PyTorch using pip/conda with GPU support. All have the same CUDA issue. I also cannot compile NCCL test on the Compute Instance with GPUs.
However, everything works fine on the Computer Cluster VMs. NCCL compiles/runs file and PyTorch has no issue with CUDA. I suspect CUDA is installed but not configured correctly on the Compute Instance VMs.
PyTorch cannot detect GPUs on the Compute Instance with NVIDIA A100 GPUs.
nvidia-smi
andnvcc --version
can return the installed CUDA version and CUDA toolkit versions. But PyTorch returns the following:I tried to pull PyTorch container from Nvidia NGC, AML base images, or compile PyTorch using pip/conda with GPU support. All have the same CUDA issue. I also cannot compile NCCL test on the Compute Instance with GPUs.
However, everything works fine on the Computer Cluster VMs. NCCL compiles/runs file and PyTorch has no issue with CUDA. I suspect CUDA is installed but not configured correctly on the Compute Instance VMs.