NVlabs / DiffRL

[ICLR 2022] Accelerated Policy Learning with Parallel Differentiable Simulation
https://short-horizon-actor-critic.github.io/
Other
262 stars 43 forks source link

Error when python test_env.py --env AntEnv #12

Open HzfFrank opened 1 year ago

HzfFrank commented 1 year ago

Excuse me, I met such problem when I try the command python test_env.py --env AntEnv in the folder examples as the guide The version of my Pytorch is 1.11.0, cuda is 12.1 Is there anything wrong with my system? I'll appreciate it a lot if you can help me with this problem.

Rebuilding kernels
Detected CUDA files, patching ldflags
Emitting ninja build file /home/frank/DiffRL/dflex/dflex/kernels/build.ninja...
Building extension module kernels...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/3] /usr/local/cuda-12.1/bin/nvcc  -DTORCH_EXTENSION_NAME=kernels -DTORCH_API_INCLUDE_EXTENSION_H -
DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -
I/home/frank/DiffRL/dflex/dflex -isystem /home/frank/anaconda3/envs/shac/lib/python3.8/site-packages/torch/include -isystem 
/home/frank/anaconda3/envs/shac/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem 
/home/frank/anaconda3/envs/shac/lib/python3.8/site-packages/torch/include/TH -isystem 
/home/frank/anaconda3/envs/shac/lib/python3.8/site-packages/torch/include/THC -isystem /usr/local/cuda-12.1/include -isystem 
/home/frank/anaconda3/envs/shac/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -
D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-
relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-
fPIC' -gencode=arch=compute_35,code=compute_35 -std=c++14 -c /home/frank/DiffRL/dflex/dflex/kernels/cuda.cu -o cuda.cuda.o
FAILED: cuda.cuda.o
/usr/local/cuda-12.1/bin/nvcc  -DTORCH_EXTENSION_NAME=kernels -DTORCH_API_INCLUDE_EXTENSION_H -
DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -
I/home/frank/DiffRL/dflex/dflex -isystem /home/frank/anaconda3/envs/shac/lib/python3.8/site-packages/torch/include -isystem 
/home/frank/anaconda3/envs/shac/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem 
/home/frank/anaconda3/envs/shac/lib/python3.8/site-packages/torch/include/TH -isystem 
/home/frank/anaconda3/envs/shac/lib/python3.8/site-packages/torch/include/THC -isystem /usr/local/cuda-12.1/include -isystem 
/home/frank/anaconda3/envs/shac/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -
D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-
relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-
fPIC' -gencode=arch=compute_35,code=compute_35 -std=c++14 -c /home/frank/DiffRL/dflex/dflex/kernels/cuda.cu -o cuda.cuda.o
nvcc fatal   : Unsupported gpu architecture 'compute_35'
[2/3] c++ -MMD -MF main.o.d -DTORCH_EXTENSION_NAME=kernels -DTORCH_API_INCLUDE_EXTENSION_H -
DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -
I/home/frank/DiffRL/dflex/dflex -isystem /home/frank/anaconda3/envs/shac/lib/python3.8/site-packages/torch/include -isystem 
/home/frank/anaconda3/envs/shac/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem 
/home/frank/anaconda3/envs/shac/lib/python3.8/site-packages/torch/include/TH -isystem 
/home/frank/anaconda3/envs/shac/lib/python3.8/site-packages/torch/include/THC -isystem /usr/local/cuda-12.1/include -isystem 
/home/frank/anaconda3/envs/shac/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -Z -O2 -DNDEBUG -c 
/home/frank/DiffRL/dflex/dflex/kernels/main.cpp -o main.o
/home/frank/DiffRL/dflex/dflex/kernels/main.cpp: In function ‘df::float3 box_sdf_grad_cpu_func(df::float3, df::float3)’:
/home/frank/DiffRL/dflex/dflex/kernels/main.cpp:1051:47: warning: control reaches end of non-void function [-Wreturn-type]
 1051 |     var_58 = df::select(var_56, var_53, var_57);
          |
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
  File "/home/frank/anaconda3/envs/shac/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1740, in _run_ninja_build
    subprocess.run(
  File "/home/frank/anaconda3/envs/shac/lib/python3.8/subprocess.py", line 516, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "test_env.py", line 17, in <module>                                                                                                                                                       
    import envs
  File "/home/frank/DiffRL/envs/__init__.py", line 8, in <module>                                                                                                                                        
    from envs.dflex_env import DFlexEnv                                                                                                                                                        
  File "/home/frank/DiffRL/envs/dflex_env.py", line 15, in <module>                                                                                                                              
    import dflex as df                                                                                                                                                                         
  File "/home/frank/DiffRL/dflex/dflex/__init__.py", line 15, in <module>                                                                                                                            
    kernel_init()                                                                                                                                                                              
  File "/home/frank/DiffRL/dflex/dflex/sim.py", line 67, in kernel_init                                                                                                                          
    kernels = df.compile()                                                                                                                                                                     
  File "/home/frank/DiffRL/dflex/dflex/adjoint.py", line 1865, in compile                                                                                                                        
    module = torch.utils.cpp_extension.load_inline('kernels',                                                                                                                                  
  File "/home/frank/anaconda3/envs/shac/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1293, in load_inline                                                                     
    return _jit_compile(                                                                                                                                                                       
  File "/home/frank/anaconda3/envs/shac/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1357, in _jit_compile                                                                    
    _write_ninja_file_and_build_library(                                                                                                                                                       
  File "/home/frank/anaconda3/envs/shac/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1469, in _write_ninja_file_and_build_library
    _run_ninja_build(                                                                                                                                                                          
  File "/home/frank/anaconda3/envs/shac/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1756, in _run_ninja_build                                                                
    raise RuntimeError(message) from e
RuntimeError: Error building extension 'kernels'
HzfFrank commented 1 year ago

I solved it after I changed to use cuda 11.7, maybe this project doesn't support the latest version of cuda, if someone can run it on the latest version of cuda, I'll appreciate it a lot if you can share it

wangrun20 commented 1 year ago

I meet the same case : ( My GPU is RTX4090, with cuda 12.1. I could not solve this problem : (

UltronAI commented 1 year ago

Similar error when running python -c "import dflex" after installation. RTX 4090 with cuda 11.6. Btw, I also failed to build dflex on A100.

Leon-LXA commented 1 year ago

After changing my cuda to 11.7, the problem still exists. RTX 3060 with cuda 11.7, pytorch 1.11.0

YuehChuan commented 11 months ago

same issue here window11 CUDA12.2 python3.8 torch2.2.0

shizhec commented 3 months ago

The issue is because this line here assumes the minimum compute capability is 35 https://github.com/NVlabs/DiffRL/blob/a4c0dd1696d3c3b885ce85a3cb64370b580cb913/dflex/dflex/adjoint.py#L1860-L1861 However after Cuda12, the minimum support version is 50: https://forums.developer.nvidia.com/t/nvcc-fatal-unsupported-gpu-architecture-compute-35/247815

I solve the issue after chance this line to:

cuda_flags = ['-gencode=arch=compute_86,code=compute_86']

I'm using CUDA12.2 and pytorch2.3.1 with RTX3060 on Ubuntu20.04 LST

I Found this link is also helpful https://stackoverflow.com/questions/68496906/pytorch-installation-for-different-cuda-architectures

hdadong commented 2 weeks ago

I installed the pytorch using

conda install pytorch==1.13.1 torchvision==0.14.1 torchaudio==0.13.1 pytorch-cuda=11.7 -c pytorch -c nvidia

in the cuda12.2, NVIDIA 4090, ubuntu20.04 system.

following the @shizhec, i check the arch of my system:

(diff) bigeast@bigeast:~/DiffRL/examples$ nvcc --list-gpu-arch
compute_50
compute_52
compute_53
compute_60
compute_61
compute_62
compute_70
compute_72
compute_75
compute_80
compute_86
compute_87
compute_89
compute_90

so i change the cuda_flags to:

cuda_flags = ['-gencode=arch=compute_86,code=compute_86']

But I still have the bug following:

(diff) bigeast@bigeast:~/DiffRL/examples$ python test_env.py --env AntEnv
Rebuilding kernels
Detected CUDA files, patching ldflags
Emitting ninja build file /home/bigeast/DiffRL/dflex/dflex/kernels/build.ninja...
Building extension module kernels...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/3] /home/bigeast/anaconda3/envs/diff/bin/nvcc  -DTORCH_EXTENSION_NAME=kernels -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/home/bigeast/DiffRL/dflex/dflex -isystem /home/bigeast/anaconda3/envs/diff/lib/python3.8/site-packages/torch/include -isystem /home/bigeast/anaconda3/envs/diff/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /home/bigeast/anaconda3/envs/diff/lib/python3.8/site-packages/torch/include/TH -isystem /home/bigeast/anaconda3/envs/diff/lib/python3.8/site-packages/torch/include/THC -isystem /home/bigeast/anaconda3/envs/diff/include -isystem /home/bigeast/anaconda3/envs/diff/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -gencode=arch=compute_86,code=sm_86 -std=c++14 -c /home/bigeast/DiffRL/dflex/dflex/kernels/cuda.cu -o cuda.cuda.o 
FAILED: cuda.cuda.o 
/home/bigeast/anaconda3/envs/diff/bin/nvcc  -DTORCH_EXTENSION_NAME=kernels -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/home/bigeast/DiffRL/dflex/dflex -isystem /home/bigeast/anaconda3/envs/diff/lib/python3.8/site-packages/torch/include -isystem /home/bigeast/anaconda3/envs/diff/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /home/bigeast/anaconda3/envs/diff/lib/python3.8/site-packages/torch/include/TH -isystem /home/bigeast/anaconda3/envs/diff/lib/python3.8/site-packages/torch/include/THC -isystem /home/bigeast/anaconda3/envs/diff/include -isystem /home/bigeast/anaconda3/envs/diff/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -gencode=arch=compute_86,code=sm_86 -std=c++14 -c /home/bigeast/DiffRL/dflex/dflex/kernels/cuda.cu -o cuda.cuda.o 
In file included from /usr/include/cuda_runtime.h:83,
                 from <command-line>:
/usr/include/crt/host_config.h:138:2: error: #error -- unsupported GNU version! gcc versions later than 8 are not supported!
  138 | #error -- unsupported GNU version! gcc versions later than 8 are not supported!
      |  ^~~~~
[2/3] c++ -MMD -MF main.o.d -DTORCH_EXTENSION_NAME=kernels -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/home/bigeast/DiffRL/dflex/dflex -isystem /home/bigeast/anaconda3/envs/diff/lib/python3.8/site-packages/torch/include -isystem /home/bigeast/anaconda3/envs/diff/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /home/bigeast/anaconda3/envs/diff/lib/python3.8/site-packages/torch/include/TH -isystem /home/bigeast/anaconda3/envs/diff/lib/python3.8/site-packages/torch/include/THC -isystem /home/bigeast/anaconda3/envs/diff/include -isystem /home/bigeast/anaconda3/envs/diff/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -Z -O2 -DNDEBUG -c /home/bigeast/DiffRL/dflex/dflex/kernels/main.cpp -o main.o 
/home/bigeast/DiffRL/dflex/dflex/kernels/main.cpp: In function ‘df::float3 box_sdf_grad_cpu_func(df::float3, df::float3)’:
/home/bigeast/DiffRL/dflex/dflex/kernels/main.cpp:1051:47: warning: control reaches end of non-void function [-Wreturn-type]
 1051 |     var_58 = df::select(var_56, var_53, var_57);
      |                                               ^
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
  File "/home/bigeast/anaconda3/envs/diff/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1900, in _run_ninja_build
    subprocess.run(
  File "/home/bigeast/anaconda3/envs/diff/lib/python3.8/subprocess.py", line 516, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "test_env.py", line 17, in <module>
    import envs
  File "/home/bigeast/DiffRL/envs/__init__.py", line 8, in <module>
    from envs.dflex_env import DFlexEnv
  File "/home/bigeast/DiffRL/envs/dflex_env.py", line 15, in <module>
    import dflex as df
  File "/home/bigeast/DiffRL/dflex/dflex/__init__.py", line 15, in <module>
    kernel_init()
  File "/home/bigeast/DiffRL/dflex/dflex/sim.py", line 67, in kernel_init
    kernels = df.compile()
  File "/home/bigeast/DiffRL/dflex/dflex/adjoint.py", line 1865, in compile
    module = torch.utils.cpp_extension.load_inline('kernels',
  File "/home/bigeast/anaconda3/envs/diff/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1433, in load_inline
    return _jit_compile(
  File "/home/bigeast/anaconda3/envs/diff/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1508, in _jit_compile
    _write_ninja_file_and_build_library(
  File "/home/bigeast/anaconda3/envs/diff/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1623, in _write_ninja_file_and_build_library
    _run_ninja_build(
  File "/home/bigeast/anaconda3/envs/diff/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1916, in _run_ninja_build
    raise RuntimeError(message) from e
RuntimeError: Error building extension 'kernels'

Then i found that the bug is came from the c++ complier:

unsupported GNU version! gcc versions later than 8 are not supported!

This means that the version of gcc installed on your system exceeds what CUDA supports, and CUDA 12.4 does not support versions higher than gcc 8. Check the gcc version: You can check the gcc version of the current system through the following command:

gcc --version
  1. Install gcc-8 version: First, you need to install gcc-8 and g++-8:
    sudo apt install gcc-8 g++-8

2. Switch gcc version:

After installation, you can switch to gcc-8 using update-alternatives to ensure the correct gcc version is used during compilation.

sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-8 8
sudo update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-8 8

3. Confirm the switch:

Run the following commands to select the version of gcc you want to use:

sudo update-alternatives --config gcc
sudo update-alternatives --config g++

After selecting gcc-8, you can re-run your compilation commands.

4. Specify the version (if you don't want to change the system default):

If you don't want to change the system-wide default gcc, you can specify gcc-8 for the compilation process like this:

CC=/usr/bin/gcc-8 CXX=/usr/bin/g++-8 python test_env.py --env AntEnv

This ensures that gcc-8, which is supported by CUDA, is used for the compilation.

Finally, I was success:

(diff) bigeast@bigeast:~/DiffRL/examples$ CC=/usr/bin/gcc-8 CXX=/usr/bin/g++-8 python test_env.py --env AntEnv
Using cached kernels
/home/bigeast/anaconda3/envs/diff/lib/python3.8/site-packages/gym/envs/registration.py:307: DeprecationWarning: The package name gym_robotics has been deprecated in favor of gymnasium_robotics. Please uninstall gym_robotics and install gymnasium_robotics with `pip install gymnasium_robotics`. Future releases will be maintained under the new package name gymnasium_robotics.
  fn()
Setting seed: 0
/home/bigeast/anaconda3/envs/diff/lib/python3.8/site-packages/gym/spaces/box.py:127: UserWarning: WARN: Box bound precision lowered by casting to float32
  logger.warn(f"Box bound precision lowered by casting to {self.dtype}")
/home/bigeast/DiffRL/dflex/dflex/model.py:1687: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at ../torch/csrc/utils/tensor_new.cpp:230.)
  m.shape_transform = torch.tensor(transform_flatten_list(self.shape_transform), dtype=torch.float32, device=adapter)
fps =  20417.947956930817
mean reward =  1281.8564453125
Finish Successfully