PKU-MARL / DexterousHands

This is a library that provides dual dexterous hand manipulation tasks through Isaac Gym
https://pku-marl.github.io/DexterousHands/
Apache License 2.0
664 stars 83 forks source link

Segmentation fault (core dumped) in Docker #8

Closed JensenLZX closed 2 years ago

JensenLZX commented 2 years ago

Segmentation fault (core dumped) in Docker

Device: NVIDIA A100 40GB PCIe GPU Accelerator

Method: Docker

Details:

I run

python train.py --task=ShadowHandOver --algo=ppo

and

python train.py --task=ShadowHandOver --algo=happo

in ~\bi-dexhands

In both task the model weights xxx.pt had been saved in ~\bi-dexhands\logs correctly.

However, at the end of these tasks, it shows error in console as following.

Output:

some episodes done, average rewards:  tensor(16.7454, device='cuda:0')
some episodes done, average rewards:  tensor(14.1145, device='cuda:0')
some episodes done, average rewards:  tensor(15.4696, device='cuda:0')
some episodes done, average rewards:  tensor(15.4252, device='cuda:0')
some episodes done, average rewards:  tensor(14.8325, device='cuda:0')
some episodes done, average rewards:  tensor(19.7192, device='cuda:0')
some episodes done, average rewards:  tensor(15.9727, device='cuda:0')

Algo happo Exp check updates 48825/48828 episodes, total num timesteps 49997824/50000000, FPS 1922.

some episodes done, average rewards:  tensor(14.0804, device='cuda:0')
some episodes done, average rewards:  tensor(17.5084, device='cuda:0')
some episodes done, average rewards:  tensor(18.6891, device='cuda:0')
Segmentation fault (core dumped)

Is there any suggestion about dealing with this error?

Thx in advance!

cypypccpy commented 2 years ago

Dear @RogerLZX ,

I'm sorry that because we rarely use docker to run Isaac Gym, I don't know the reason for this bug. It looks like this bug only appears at the end of the task, so maybe you can increase the number of episodes to achieve the same effect.

Isaac Gym is still in development, so there will inevitably be many of these bugs. I recommend that you can go to the DevTalk Forum to find or ask about this bug, usually there will be NVIDIA developers to answer the questions if they know.

Hope this can help you.

song-hl commented 2 years ago

@RogerLZX you can use faulthandler to locate your problem, hers is a tutorial faulthandler. my problem is as follow

  File "/workspaces/test_docker/DexterousHands/bi-dexhands/tasks/shadow_hand_over.py", line 234 in create_sim
  File "/workspaces/test_docker/DexterousHands/bi-dexhands/tasks/hand_base/base_task.py", line 82 in __init__
  File "/workspaces/test_docker/DexterousHands/bi-dexhands/tasks/shadow_hand_over.py", line 170 in __init__
  File "/workspaces/test_docker/DexterousHands/bi-dexhands/utils/parse_task.py", line 85 in parse_task
  File "train.py", line 43 in train
  File "train.py", line 99 in <module>

as the docker don't haver a graphic viewer,so we need to set the parameter "headless" as True, run train.py with --headless,like python train.py --task=ShadowHandOver --algo=ppo --test --headless this works for me

JensenLZX commented 2 years ago

@cypypccpy Thanks for your reply. @ustchlsong's answer helps me out.

@ustchlsong Thanks for your reply. It helps a lot! And the tool you've recommended, faulthandler, is really useful! Thanks~