Open DaojiePENG opened 2 months ago
Did you solve this problem? I'm encountering the same issue.
I'm deployed to run within a container. When I connect to the container via SSH and execute the command python legged_gym/scripts/train.py --headless --task go2_distill
, like you, I encounter a segmentation fault. However, when I connect to the container using TurboVNC and execute the command, there is no error.
It probably due to the graphics driver, e.g. Vulkan.
Please make sure you can successfully run the scripts in isaacgym's python example.
Did you solve this problem? I'm encountering the same issue.
Unluckily, not yet.
It probably due to the graphics driver, e.g. Vulkan.
Please make sure you can successfully run the scripts in isaacgym's python example.
I can trian the go2_rough and go2_fied with no problem. I also checked the graphics with 'nvidia-smi' and it out put the information normally:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.125.06 Driver Version: 525.125.06 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100-PCI... Off | 00000000:31:00.0 Off | 0 |
| N/A 65C P0 222W / 250W | 35504MiB / 40960MiB | 58% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A100-PCI... Off | 00000000:B1:00.0 Off | 0 |
| N/A 69C P0 247W / 250W | 39496MiB / 40960MiB | 100% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 575975 G /usr/lib/xorg/Xorg 4MiB |
| 0 N/A N/A 1576569 C ...imba/anaconda3/bin/python 35492MiB |
| 0 N/A N/A 3335579 G /usr/lib/xorg/Xorg 4MiB |
| 1 N/A N/A 575975 G /usr/lib/xorg/Xorg 4MiB |
| 1 N/A N/A 1570278 C ...imba/anaconda3/bin/python 39484MiB |
| 1 N/A N/A 3335579 G /usr/lib/xorg/Xorg 4MiB |
+-----------------------------------------------------------------------------+
Moreover, I tried it on another server and met exactly the same problem.
root@0b633f36d09a:/home/parkour_env/parkour/legged_gym# python3 legged_gym/scripts/train.py --headless --task go2_distill
Importing module 'gym_38' (/home/parkour_env/isaacgym/python/isaacgym/_bindings/linux-x86_64/gym_38.so)
Setting GYM_USD_PLUG_INFO_PATH to /home/parkour_env/isaacgym/python/isaacgym/_bindings/linux-x86_64/usd/plugInfo.json
PyTorch version 1.10.0+cu113
Device count 4
/home/parkour_env/isaacgym/python/isaacgym/_bindings/src/gymtorch
Using /root/.cache/torch_extensions/py38_cu113 as PyTorch extensions root...
Emitting ninja build file /root/.cache/torch_extensions/py38_cu113/gymtorch/build.ninja...
Building extension module gymtorch...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module gymtorch...
Setting seed: 1
Using LeggedRobotField.__init__, num_obs and num_privileged_obs will be computed instead of assigned.
Not connected to PVD
+++ Using GPU PhysX
Physics Engine: PhysX
Physics Device: cuda:0
GPU Pipeline: enabled
WARNING: lavapipe is not a conformant vulkan implementation, testing use only.
heightfield_raw data shape: 2320 5520 border size: 200
Segmentation fault (core dumped)
root@0b633f36d09a:/home/parkour_env/parkour/legged_gym# nvidia-smi
The server's 'nvidia-smi' shows:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01 Driver Version: 515.65.01 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:31:00.0 Off | N/A |
| 30% 33C P8 24W / 350W | 5MiB / 24576MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA GeForce ... Off | 00000000:4B:00.0 Off | N/A |
| 30% 31C P8 21W / 350W | 5MiB / 24576MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA GeForce ... Off | 00000000:B1:00.0 Off | N/A |
| 30% 31C P8 23W / 350W | 5MiB / 24576MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA GeForce ... Off | 00000000:CA:00.0 Off | N/A |
| 30% 32C P8 22W / 350W | 5MiB / 24576MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
Is that really the problem of graphics driver? Do I have to reinstall the driver or try something else?
I encountered some other problems, but I solved by setting the first two params to
self.sim = self.gym.create_sim(0, -1, gymapi.SIM_PHYSX, self.sim_params)
@DaojiePENG Hi,
Have you solved this problem yet? I have also encountered the same problem.
root@8431ca414288:~/parkour/legged_gym# python legged_gym/scripts/collect.py --headless --task a1_distill Importing module 'gym_38' (/root/isaacgym/python/isaacgym/_bindings/linux-x86_64/gym_38.so) Setting GYM_USD_PLUG_INFO_PATH to /root/isaacgym/python/isaacgym/_bindings/linux-x86_64/usd/plugInfo.json PyTorch version 2.1.0+cu121 Device count 1 /root/isaacgym/python/isaacgym/_bindings/src/gymtorch Using /root/.cache/torch_extensions/py38_cu121 as PyTorch extensions root... Emitting ninja build file /root/.cache/torch_extensions/py38_cu121/gymtorch/build.ninja... Building extension module gymtorch... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module gymtorch... /usr/lib/python3/dist-packages/requests/__init__.py:89: RequestsDependencyWarning: urllib3 (2.1.0) or chardet (3.0.4) doesn't match a supported version! warnings.warn("urllib3 ({}) or chardet ({}) doesn't match a supported " Setting seed: 1 Using LeggedRobotField.__init__, num_obs and num_privileged_obs will be computed instead of assigned. Not connected to PVD +++ Using GPU PhysX Physics Engine: PhysX Physics Device: cuda:0 GPU Pipeline: enabled WARNING: lavapipe is not a conformant vulkan implementation, testing use only. Segmentation fault (core dumped)
May I ask how to solve it?
Thank you!
@DaojiePENG Hi, Have you solved this problem yet? I have also encountered the same problem.
root@8431ca414288:~/parkour/legged_gym# python legged_gym/scripts/collect.py --headless --task a1_distill Importing module 'gym_38' (/root/isaacgym/python/isaacgym/_bindings/linux-x86_64/gym_38.so) Setting GYM_USD_PLUG_INFO_PATH to /root/isaacgym/python/isaacgym/_bindings/linux-x86_64/usd/plugInfo.json PyTorch version 2.1.0+cu121 Device count 1 /root/isaacgym/python/isaacgym/_bindings/src/gymtorch Using /root/.cache/torch_extensions/py38_cu121 as PyTorch extensions root... Emitting ninja build file /root/.cache/torch_extensions/py38_cu121/gymtorch/build.ninja... Building extension module gymtorch... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module gymtorch... /usr/lib/python3/dist-packages/requests/__init__.py:89: RequestsDependencyWarning: urllib3 (2.1.0) or chardet (3.0.4) doesn't match a supported version! warnings.warn("urllib3 ({}) or chardet ({}) doesn't match a supported " Setting seed: 1 Using LeggedRobotField.__init__, num_obs and num_privileged_obs will be computed instead of assigned. Not connected to PVD +++ Using GPU PhysX Physics Engine: PhysX Physics Device: cuda:0 GPU Pipeline: enabled WARNING: lavapipe is not a conformant vulkan implementation, testing use only. Segmentation fault (core dumped)
May I ask how to solve it? Thank you!
Not yet, I have tried to ensure self.graphics_device_id = self.sim_device_id
in create_sim() but it didn't work. The result was the same like before.
@DaojiePENG I'm sorry to hear that. I will continue to try. We can communicate and share our experiences anytime
Hi~ I encountered
Segmentation fault (core dumped)
when distilling go2 policy. It terminated without starting the job.root@ca96cdb478d7:/home/parkour_env/parkour/legged_gym# python3 legged_gym/scripts/train.py --headless --task go2_distill Importing module 'gym_38' (/home/parkour_env/isaacgym/python/isaacgym/_bindings/linux-x86_64/gym_38.so) Setting GYM_USD_PLUG_INFO_PATH to /home/parkour_env/isaacgym/python/isaacgym/_bindings/linux-x86_64/usd/plugInfo.json PyTorch version 1.10.0+cu113 Device count 2 /home/parkour_env/isaacgym/python/isaacgym/_bindings/src/gymtorch Using /root/.cache/torch_extensions/py38_cu113 as PyTorch extensions root... Emitting ninja build file /root/.cache/torch_extensions/py38_cu113/gymtorch/build.ninja... Building extension module gymtorch... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module gymtorch... Setting seed: 1 Using LeggedRobotField.__init__, num_obs and num_privileged_obs will be computed instead of assigned. Not connected to PVD +++ Using GPU PhysX Physics Engine: PhysX Physics Device: cuda:0 GPU Pipeline: enabled WARNING: lavapipe is not a conformant vulkan implementation, testing use only. heightfield_raw data shape: 2320 5520 border size: 200 Segmentation fault (core dumped) root@ca96cdb478d7:/home/parkour_env/parkour/legged_gym# nvidia-smi Wed Sep 25 01:15:21 2024
I configed
multi_process_
toFalse
, runned in docker on A100. I think this may not because of the memory size. Any ideas about this problem? Thanks for any advice~