ZiwenZhuang / parkour

[CoRL 2023] Robot Parkour Learning
https://robot-parkour.github.io
MIT License
538 stars 94 forks source link

Segmentation fault (core dumped) when distilling go2 policy #60

Open DaojiePENG opened 2 weeks ago

DaojiePENG commented 2 weeks ago

Hi~ I encountered Segmentation fault (core dumped) when distilling go2 policy. It terminated without starting the job.

root@ca96cdb478d7:/home/parkour_env/parkour/legged_gym# python3 legged_gym/scripts/train.py --headless --task go2_distill Importing module 'gym_38' (/home/parkour_env/isaacgym/python/isaacgym/_bindings/linux-x86_64/gym_38.so) Setting GYM_USD_PLUG_INFO_PATH to /home/parkour_env/isaacgym/python/isaacgym/_bindings/linux-x86_64/usd/plugInfo.json PyTorch version 1.10.0+cu113 Device count 2 /home/parkour_env/isaacgym/python/isaacgym/_bindings/src/gymtorch Using /root/.cache/torch_extensions/py38_cu113 as PyTorch extensions root... Emitting ninja build file /root/.cache/torch_extensions/py38_cu113/gymtorch/build.ninja... Building extension module gymtorch... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module gymtorch... Setting seed: 1 Using LeggedRobotField.__init__, num_obs and num_privileged_obs will be computed instead of assigned. Not connected to PVD +++ Using GPU PhysX Physics Engine: PhysX Physics Device: cuda:0 GPU Pipeline: enabled WARNING: lavapipe is not a conformant vulkan implementation, testing use only. heightfield_raw data shape: 2320 5520 border size: 200 Segmentation fault (core dumped) root@ca96cdb478d7:/home/parkour_env/parkour/legged_gym# nvidia-smi Wed Sep 25 01:15:21 2024

I configed multi_process_ to False, runned in docker on A100. I think this may not because of the memory size. Any ideas about this problem? Thanks for any advice~

CoderWangcai commented 4 days ago

Did you solve this problem? I'm encountering the same issue.

CoderWangcai commented 3 days ago

I'm deployed to run within a container. When I connect to the container via SSH and execute the command python legged_gym/scripts/train.py --headless --task go2_distill, like you, I encounter a segmentation fault. However, when I connect to the container using TurboVNC and execute the command, there is no error.

image
ZiwenZhuang commented 3 days ago

It probably due to the graphics driver, e.g. Vulkan.

Please make sure you can successfully run the scripts in isaacgym's python example.

DaojiePENG commented 1 day ago

Did you solve this problem? I'm encountering the same issue.

Unluckily, not yet.

It probably due to the graphics driver, e.g. Vulkan.

Please make sure you can successfully run the scripts in isaacgym's python example.

I can trian the go2_rough and go2_fied with no problem. I also checked the graphics with 'nvidia-smi' and it out put the information normally:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.125.06   Driver Version: 525.125.06   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-PCI...  Off  | 00000000:31:00.0 Off |                    0 |
| N/A   65C    P0   222W / 250W |  35504MiB / 40960MiB |     58%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-PCI...  Off  | 00000000:B1:00.0 Off |                    0 |
| N/A   69C    P0   247W / 250W |  39496MiB / 40960MiB |    100%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A    575975      G   /usr/lib/xorg/Xorg                  4MiB |
|    0   N/A  N/A   1576569      C   ...imba/anaconda3/bin/python    35492MiB |
|    0   N/A  N/A   3335579      G   /usr/lib/xorg/Xorg                  4MiB |
|    1   N/A  N/A    575975      G   /usr/lib/xorg/Xorg                  4MiB |
|    1   N/A  N/A   1570278      C   ...imba/anaconda3/bin/python    39484MiB |
|    1   N/A  N/A   3335579      G   /usr/lib/xorg/Xorg                  4MiB |
+-----------------------------------------------------------------------------+

Moreover, I tried it on another server and met exactly the same problem.

root@0b633f36d09a:/home/parkour_env/parkour/legged_gym# python3 legged_gym/scripts/train.py --headless --task go2_distill
Importing module 'gym_38' (/home/parkour_env/isaacgym/python/isaacgym/_bindings/linux-x86_64/gym_38.so)
Setting GYM_USD_PLUG_INFO_PATH to /home/parkour_env/isaacgym/python/isaacgym/_bindings/linux-x86_64/usd/plugInfo.json
PyTorch version 1.10.0+cu113
Device count 4
/home/parkour_env/isaacgym/python/isaacgym/_bindings/src/gymtorch
Using /root/.cache/torch_extensions/py38_cu113 as PyTorch extensions root...
Emitting ninja build file /root/.cache/torch_extensions/py38_cu113/gymtorch/build.ninja...
Building extension module gymtorch...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module gymtorch...
Setting seed: 1
Using LeggedRobotField.__init__, num_obs and num_privileged_obs will be computed instead of assigned.
Not connected to PVD
+++ Using GPU PhysX
Physics Engine: PhysX
Physics Device: cuda:0
GPU Pipeline: enabled
WARNING: lavapipe is not a conformant vulkan implementation, testing use only.
heightfield_raw data shape: 2320 5520 border size: 200
Segmentation fault (core dumped)
root@0b633f36d09a:/home/parkour_env/parkour/legged_gym# nvidia-smi

The server's 'nvidia-smi' shows:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:31:00.0 Off |                  N/A |
| 30%   33C    P8    24W / 350W |      5MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce ...  Off  | 00000000:4B:00.0 Off |                  N/A |
| 30%   31C    P8    21W / 350W |      5MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA GeForce ...  Off  | 00000000:B1:00.0 Off |                  N/A |
| 30%   31C    P8    23W / 350W |      5MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA GeForce ...  Off  | 00000000:CA:00.0 Off |                  N/A |
| 30%   32C    P8    22W / 350W |      5MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

Is that really the problem of graphics driver? Do I have to reinstall the driver or try something else?