Torch environment error

Hi,

Thanks for your great work! I have loaded the docker image provided and run the container accordingly, inside the container there are some installed python libraries including pytorch shown from pip list command. However, without installing any other libraries, direct running import torch; print(torch.cuda.is_available()) gives following error:

/usr/local/lib/python3.9/dist-packages/torch/cuda/__init__.py:118: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
  return torch._C._cuda_getDeviceCount() > 0
False

The pip list command gives info about version of torch libraries as :

torch                     2.3.1+cu121
torchaudio                2.3.1+cu121
torchmetrics              0.7.2
torchvision               0.18.1+cu121

my setup info is a ubuntu server with eight 3090 gpus, and nvidia related info are:

nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2023 NVIDIA Corporation Built on Mon_Apr__3_17:16:06_PDT_2023 Cuda compilation tools, release 12.1, V12.1.105 Build cuda_12.1.r12.1/compiler.32688072_0

NVIDIA-SMI 525.89.02 Driver Version: 525.89.02 CUDA Version: 12.1

It seems the driver works fine since both the nvidia-smi and nvcc command work fine. I want to know if this error is related to the misoperation when setting the docker or anything missing? Thanks!

HorizonRobotics / GUMP

Torch environment error #1