HorizonRobotics / GUMP

Generative model for Unified Motion Planning tasks
Apache License 2.0
119 stars 8 forks source link

Torch environment error #1

Closed zachytong closed 2 months ago

zachytong commented 4 months ago

Hi,

Thanks for your great work! I have loaded the docker image provided and run the container accordingly, inside the container there are some installed python libraries including pytorch shown from pip list command. However, without installing any other libraries, direct running import torch; print(torch.cuda.is_available()) gives following error:

/usr/local/lib/python3.9/dist-packages/torch/cuda/__init__.py:118: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
  return torch._C._cuda_getDeviceCount() > 0
False

The pip list command gives info about version of torch libraries as :

torch                     2.3.1+cu121
torchaudio                2.3.1+cu121
torchmetrics              0.7.2
torchvision               0.18.1+cu121

my setup info is a ubuntu server with eight 3090 gpus, and nvidia related info are:

nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2023 NVIDIA Corporation Built on Mon_Apr__3_17:16:06_PDT_2023 Cuda compilation tools, release 12.1, V12.1.105 Build cuda_12.1.r12.1/compiler.32688072_0

NVIDIA-SMI 525.89.02 Driver Version: 525.89.02 CUDA Version: 12.1

It seems the driver works fine since both the nvidia-smi and nvcc command work fine. I want to know if this error is related to the misoperation when setting the docker or anything missing? Thanks!

JingyuQian commented 4 months ago

https://github.com/pytorch/pytorch/issues/40671#issuecomment-904794431 This seems to be a problem that your GPU driver somehow does not support the CUDA version inside the docker. Not sure why.

You can try building a different Docker image to see if it works. In the Dockerfile there are commented lines to let you choose different base images (line 15) and different pytorch versions (line 119).