ZiwenZhuang / parkour

[CoRL 2023] Robot Parkour Learning
https://robot-parkour.github.io
MIT License
595 stars 101 forks source link

Segmentation fault (core dumped) when distilling go2 policy #60

Open DaojiePENG opened 2 months ago

DaojiePENG commented 2 months ago

Hi~ I encountered Segmentation fault (core dumped) when distilling go2 policy. It terminated without starting the job.

root@ca96cdb478d7:/home/parkour_env/parkour/legged_gym# python3 legged_gym/scripts/train.py --headless --task go2_distill Importing module 'gym_38' (/home/parkour_env/isaacgym/python/isaacgym/_bindings/linux-x86_64/gym_38.so) Setting GYM_USD_PLUG_INFO_PATH to /home/parkour_env/isaacgym/python/isaacgym/_bindings/linux-x86_64/usd/plugInfo.json PyTorch version 1.10.0+cu113 Device count 2 /home/parkour_env/isaacgym/python/isaacgym/_bindings/src/gymtorch Using /root/.cache/torch_extensions/py38_cu113 as PyTorch extensions root... Emitting ninja build file /root/.cache/torch_extensions/py38_cu113/gymtorch/build.ninja... Building extension module gymtorch... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module gymtorch... Setting seed: 1 Using LeggedRobotField.__init__, num_obs and num_privileged_obs will be computed instead of assigned. Not connected to PVD +++ Using GPU PhysX Physics Engine: PhysX Physics Device: cuda:0 GPU Pipeline: enabled WARNING: lavapipe is not a conformant vulkan implementation, testing use only. heightfield_raw data shape: 2320 5520 border size: 200 Segmentation fault (core dumped) root@ca96cdb478d7:/home/parkour_env/parkour/legged_gym# nvidia-smi Wed Sep 25 01:15:21 2024

I configed multi_process_ to False, runned in docker on A100. I think this may not because of the memory size. Any ideas about this problem? Thanks for any advice~

CoderWangcai commented 1 month ago

Did you solve this problem? I'm encountering the same issue.

CoderWangcai commented 1 month ago

I'm deployed to run within a container. When I connect to the container via SSH and execute the command python legged_gym/scripts/train.py --headless --task go2_distill, like you, I encounter a segmentation fault. However, when I connect to the container using TurboVNC and execute the command, there is no error.

image
ZiwenZhuang commented 1 month ago

It probably due to the graphics driver, e.g. Vulkan.

Please make sure you can successfully run the scripts in isaacgym's python example.

DaojiePENG commented 1 month ago

Did you solve this problem? I'm encountering the same issue.

Unluckily, not yet.

It probably due to the graphics driver, e.g. Vulkan.

Please make sure you can successfully run the scripts in isaacgym's python example.

I can trian the go2_rough and go2_fied with no problem. I also checked the graphics with 'nvidia-smi' and it out put the information normally:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.125.06   Driver Version: 525.125.06   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-PCI...  Off  | 00000000:31:00.0 Off |                    0 |
| N/A   65C    P0   222W / 250W |  35504MiB / 40960MiB |     58%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-PCI...  Off  | 00000000:B1:00.0 Off |                    0 |
| N/A   69C    P0   247W / 250W |  39496MiB / 40960MiB |    100%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A    575975      G   /usr/lib/xorg/Xorg                  4MiB |
|    0   N/A  N/A   1576569      C   ...imba/anaconda3/bin/python    35492MiB |
|    0   N/A  N/A   3335579      G   /usr/lib/xorg/Xorg                  4MiB |
|    1   N/A  N/A    575975      G   /usr/lib/xorg/Xorg                  4MiB |
|    1   N/A  N/A   1570278      C   ...imba/anaconda3/bin/python    39484MiB |
|    1   N/A  N/A   3335579      G   /usr/lib/xorg/Xorg                  4MiB |
+-----------------------------------------------------------------------------+

Moreover, I tried it on another server and met exactly the same problem.

root@0b633f36d09a:/home/parkour_env/parkour/legged_gym# python3 legged_gym/scripts/train.py --headless --task go2_distill
Importing module 'gym_38' (/home/parkour_env/isaacgym/python/isaacgym/_bindings/linux-x86_64/gym_38.so)
Setting GYM_USD_PLUG_INFO_PATH to /home/parkour_env/isaacgym/python/isaacgym/_bindings/linux-x86_64/usd/plugInfo.json
PyTorch version 1.10.0+cu113
Device count 4
/home/parkour_env/isaacgym/python/isaacgym/_bindings/src/gymtorch
Using /root/.cache/torch_extensions/py38_cu113 as PyTorch extensions root...
Emitting ninja build file /root/.cache/torch_extensions/py38_cu113/gymtorch/build.ninja...
Building extension module gymtorch...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module gymtorch...
Setting seed: 1
Using LeggedRobotField.__init__, num_obs and num_privileged_obs will be computed instead of assigned.
Not connected to PVD
+++ Using GPU PhysX
Physics Engine: PhysX
Physics Device: cuda:0
GPU Pipeline: enabled
WARNING: lavapipe is not a conformant vulkan implementation, testing use only.
heightfield_raw data shape: 2320 5520 border size: 200
Segmentation fault (core dumped)
root@0b633f36d09a:/home/parkour_env/parkour/legged_gym# nvidia-smi

The server's 'nvidia-smi' shows:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:31:00.0 Off |                  N/A |
| 30%   33C    P8    24W / 350W |      5MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce ...  Off  | 00000000:4B:00.0 Off |                  N/A |
| 30%   31C    P8    21W / 350W |      5MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA GeForce ...  Off  | 00000000:B1:00.0 Off |                  N/A |
| 30%   31C    P8    23W / 350W |      5MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA GeForce ...  Off  | 00000000:CA:00.0 Off |                  N/A |
| 30%   32C    P8    22W / 350W |      5MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

Is that really the problem of graphics driver? Do I have to reinstall the driver or try something else?

youwyu commented 1 month ago

I encountered some other problems, but I solved by setting the first two params to self.sim = self.gym.create_sim(0, -1, gymapi.SIM_PHYSX, self.sim_params)

D-jojo commented 1 month ago

@DaojiePENG Hi, Have you solved this problem yet? I have also encountered the same problem. root@8431ca414288:~/parkour/legged_gym# python legged_gym/scripts/collect.py --headless --task a1_distill Importing module 'gym_38' (/root/isaacgym/python/isaacgym/_bindings/linux-x86_64/gym_38.so) Setting GYM_USD_PLUG_INFO_PATH to /root/isaacgym/python/isaacgym/_bindings/linux-x86_64/usd/plugInfo.json PyTorch version 2.1.0+cu121 Device count 1 /root/isaacgym/python/isaacgym/_bindings/src/gymtorch Using /root/.cache/torch_extensions/py38_cu121 as PyTorch extensions root... Emitting ninja build file /root/.cache/torch_extensions/py38_cu121/gymtorch/build.ninja... Building extension module gymtorch... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module gymtorch... /usr/lib/python3/dist-packages/requests/__init__.py:89: RequestsDependencyWarning: urllib3 (2.1.0) or chardet (3.0.4) doesn't match a supported version! warnings.warn("urllib3 ({}) or chardet ({}) doesn't match a supported " Setting seed: 1 Using LeggedRobotField.__init__, num_obs and num_privileged_obs will be computed instead of assigned. Not connected to PVD +++ Using GPU PhysX Physics Engine: PhysX Physics Device: cuda:0 GPU Pipeline: enabled WARNING: lavapipe is not a conformant vulkan implementation, testing use only. Segmentation fault (core dumped) May I ask how to solve it? Thank you!

DaojiePENG commented 1 month ago

@DaojiePENG Hi, Have you solved this problem yet? I have also encountered the same problem. root@8431ca414288:~/parkour/legged_gym# python legged_gym/scripts/collect.py --headless --task a1_distill Importing module 'gym_38' (/root/isaacgym/python/isaacgym/_bindings/linux-x86_64/gym_38.so) Setting GYM_USD_PLUG_INFO_PATH to /root/isaacgym/python/isaacgym/_bindings/linux-x86_64/usd/plugInfo.json PyTorch version 2.1.0+cu121 Device count 1 /root/isaacgym/python/isaacgym/_bindings/src/gymtorch Using /root/.cache/torch_extensions/py38_cu121 as PyTorch extensions root... Emitting ninja build file /root/.cache/torch_extensions/py38_cu121/gymtorch/build.ninja... Building extension module gymtorch... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module gymtorch... /usr/lib/python3/dist-packages/requests/__init__.py:89: RequestsDependencyWarning: urllib3 (2.1.0) or chardet (3.0.4) doesn't match a supported version! warnings.warn("urllib3 ({}) or chardet ({}) doesn't match a supported " Setting seed: 1 Using LeggedRobotField.__init__, num_obs and num_privileged_obs will be computed instead of assigned. Not connected to PVD +++ Using GPU PhysX Physics Engine: PhysX Physics Device: cuda:0 GPU Pipeline: enabled WARNING: lavapipe is not a conformant vulkan implementation, testing use only. Segmentation fault (core dumped) May I ask how to solve it? Thank you!

Not yet, I have tried to ensure self.graphics_device_id = self.sim_device_id in create_sim() but it didn't work. The result was the same like before.

D-jojo commented 1 month ago

@DaojiePENG I'm sorry to hear that. I will continue to try. We can communicate and share our experiences anytime