eleurent / rl-agents

Implementations of Reinforcement Learning and Planning algorithms
MIT License
591 stars 153 forks source link

'Segmentation fault' when I was testing in the env ,env_medium with DQNAgent/1_step. #32

Closed zhaoworking closed 4 years ago

zhaoworking commented 4 years ago

`(gymlab) root@iZ8vbhynnqk42im5ymgijyZ:~/rl-agents/scripts# python3 experiments.py evaluate configs/HighwayEnv/env_medium.json configs/HighwayEnv/agents/DQNAgent/1_step.json --train --episodes=1000 pygame 1.9.6 Hello from the pygame community. https://www.pygame.org/contribute.html INFO: Making new env: highway-v0 /root/anaconda3/lib/python3.6/site-packages/numpy/core/numeric.py:301: FutureWarning: in the future, full((5, 5), -1) will return an array of dtype('int64') format(shape, fill_value, array(fill_value).dtype), FutureWarning) /root/anaconda3/lib/python3.6/site-packages/numpy/core/numeric.py:301: FutureWarning: in the future, full((5, 5), 1) will return an array of dtype('int64') format(shape, fill_value, array(fill_value).dtype), FutureWarning) /root/gym/gym/logger.py:30: UserWarning: WARN: Box bound precision lowered by casting to float32 warnings.warn(colorize('%s: %s'%('WARN', msg % args), 'yellow')) [ERROR] Preferred device cuda:best unavailable, switching to default cpu INFO: Creating monitor directory out/HighwayEnv/DQNAgent/run_20200405-230154_7882 profiler execution failed ALSA lib confmisc.c:768:(parse_card) cannot find card '0' ALSA lib conf.c:4292:(_snd_config_evaluate) function snd_func_card_driver returned error: No such file or directory ALSA lib confmisc.c:392:(snd_func_concat) error evaluating strings ALSA lib conf.c:4292:(_snd_config_evaluate) function snd_func_concat returned error: No such file or directory ALSA lib confmisc.c:1251:(snd_func_refer) error evaluating name ALSA lib conf.c:4292:(_snd_config_evaluate) function snd_func_refer returned error: No such file or directory

ALSA lib conf.c:4771:(snd_config_expand) Evaluate error: No such file or directory

ALSA lib pcm.c:2266:(snd_pcm_open_noupdate) Unknown PCM default INFO: Starting new video recorder writing to /root/rl-agents/scripts/out/HighwayEnv/DQNAgent/run_20200405-230154_7882/openaigym.video.0.7882.video000000.mp4 Segmentation fault` When i was testing the env_medium with DQN , i got this fault. It should be noted that i was using the SSH to test it and the operated way is CPU instead of CUDA.Can u help me ?

eleurent commented 4 years ago

[ERROR] Preferred device cuda:best unavailable, switching to default cpu

This should be changed to a warning, it is not the cause of the segfault.

The error seems to be related to the ALSA lib for sound devices, which is called by pygame at initialization. Are you trying to run this code on a server with no display ? If so, try to run experiments.py with the --no-display option, which will prevent calls to pygame.

zhaoworking commented 4 years ago

[ERROR] Preferred device cuda:best unavailable, switching to default cpu

This should be changed to a warning, it is not the cause of the segfault.

The error seems to be related to the ALSA lib for sound devices, which is called by pygame at initialization. Are you trying to run this code on a server with no display ? If so, try to run experiments.py with the --no-display option, which will prevent calls to pygame.

pygame 1.9.6 Hello from the pygame community. https://www.pygame.org/contribute.html INFO: Making new env: highway-v0 /root/anaconda3/lib/python3.6/site-packages/numpy/core/numeric.py:301: FutureWarning: in the future, full((5, 5), -1) will return an array of dtype('int64') format(shape, fill_value, array(fill_value).dtype), FutureWarning) /root/anaconda3/lib/python3.6/site-packages/numpy/core/numeric.py:301: FutureWarning: in the future, full((5, 5), 1) will return an array of dtype('int64') format(shape, fill_value, array(fill_value).dtype), FutureWarning) /root/gym/gym/logger.py:30: UserWarning: WARN: Box bound precision lowered by casting to float32 warnings.warn(colorize('%s: %s'%('WARN', msg % args), 'yellow')) [ERROR] Preferred device cuda:best unavailable, switching to default cpu INFO: Creating monitor directory out/HighwayEnv/DQNAgent/run_20200406-174819_9262 profiler execution failed Segmentation fault I test with the command python3 experiments.py evaluate configs/HighwayEnv/env_easy.json configs/HighwayEnv/agents/DQNAgent/1_step.json --train --episodes=1000 --no-display, according to what you said , I find the same segmentation fault above. But i saw there is a similar INFO that INFO: Creating monitor directory out/HighwayEnv/DQNAgent/run_20200406-174819_9262 .Is the command that i use wrong , or something else ?

eleurent commented 4 years ago

The command that you used is right, so I guess the problem is not related to rendering.

I could not reproduce the issue on my computer:

python3 experiments.py evaluate configs/HighwayEnv/env_easy.json configs/HighwayEnv/agents/DQNAgent/1_step.json --train --episodes=1000 --no-display
pygame 1.9.4
Hello from the pygame community. https://www.pygame.org/contribute.html
INFO: Making new env: highway-v0
[INFO] Choosing GPU device: 0, memory used: 1563 
INFO: Creating monitor directory out\HighwayEnv\DQNAgent\run_20200406-125052_9276
C:\Anaconda3\lib\site-packages\torch\onnx\utils.py:501: UserWarning: ONNX export failed on ATen operator reshape because torch.onnx.symbolic.reshape does not exist
  .format(op_name, op_name))
[INFO] Episode 0 score: 3.5 
[INFO] Episode 1 score: 11.9 
[INFO] Episode 2 score: 2.6 
[INFO] Episode 3 score: 5.8 
[INFO] Episode 4 score: 3.3 
[INFO] Episode 5 score: 15.5 
[INFO] Episode 6 score: 15.1 
[INFO] Episode 7 score: 3.2 
[INFO] Episode 8 score: 17.2 
[INFO] Episode 9 score: 4.4 
[INFO] Episode 10 score: 4.7 
[INFO] Episode 11 score: 7.5 
[INFO] Episode 12 score: 4.9 
[INFO] Episode 13 score: 6.5 
[INFO] Episode 14 score: 14.5 

Do you have this issue only with the 1_step.json configuration ?

zhaoworking commented 4 years ago

The command that you used is right, so I guess the problem is not related to rendering.

I could not reproduce the issue on my computer:

python3 experiments.py evaluate configs/HighwayEnv/env_easy.json configs/HighwayEnv/agents/DQNAgent/1_step.json --train --episodes=1000 --no-display
pygame 1.9.4
Hello from the pygame community. https://www.pygame.org/contribute.html
INFO: Making new env: highway-v0
[INFO] Choosing GPU device: 0, memory used: 1563 
INFO: Creating monitor directory out\HighwayEnv\DQNAgent\run_20200406-125052_9276
C:\Anaconda3\lib\site-packages\torch\onnx\utils.py:501: UserWarning: ONNX export failed on ATen operator reshape because torch.onnx.symbolic.reshape does not exist
  .format(op_name, op_name))
[INFO] Episode 0 score: 3.5 
[INFO] Episode 1 score: 11.9 
[INFO] Episode 2 score: 2.6 
[INFO] Episode 3 score: 5.8 
[INFO] Episode 4 score: 3.3 
[INFO] Episode 5 score: 15.5 
[INFO] Episode 6 score: 15.1 
[INFO] Episode 7 score: 3.2 
[INFO] Episode 8 score: 17.2 
[INFO] Episode 9 score: 4.4 
[INFO] Episode 10 score: 4.7 
[INFO] Episode 11 score: 7.5 
[INFO] Episode 12 score: 4.9 
[INFO] Episode 13 score: 6.5 
[INFO] Episode 14 score: 14.5 

Do you have this issue only with the 1_step.json configuration ?

Unforunately , I find that all of the agents come to this issue.Is it related to my computer? And i also can't see the item [INFO] Episode x score: x in my Xshell.

eleurent commented 4 years ago

It is probably related to your computer, since the automatic tests are passing: Badge

But I really wonder what could cause such a segmentation fault...

Could you maybe try to use an IDE like PyCharm and going step by step with a debugger, to see where it crashes exactly?

eleurent commented 4 years ago

Do you also have a segmentation fault with the cartpole environment for example, or is it only with highway-env ?

eleurent commented 4 years ago

Also, could you try adding these lines at the top of experiments.py (after other imports) ?

import os
os.environ['SDL_AUDIODRIVER'] = 'dsp'
zhaoworking commented 4 years ago
[ERROR] Preferred device cuda:best unavailable, switching to default cpu 
INFO: Creating monitor directory out/HighwayEnv/DQNAgent/run_20200409-203051_12866
profiler execution failed
INFO: Starting new video recorder writing to /root/rl-agents/scripts/out/HighwayEnv/DQNAgent/run_20200409-203051_12866/openaigym.video.0.12866.video000000.mp4
Segmentation fault

Addingthe code into the experiments.py,only to find the same situation. And the day before yesterday , I found it worked well in my VM , of which the version is desktop .Such that, i could see the videoand [INFO] without any error. So,I guess the issue is most likely related to my remote Linux server,even though i don't know what it is.

eleurent commented 4 years ago

As I mentionned, the ALSA lib responsible for the segfault is used by pygame for audio management, and it crashes when it cannot find audio drivers (on your Linux server), probably when pygame is initialised through pygame.init(). However, pygame is only used for rendering and should not be initialised when the --no-display option is used...

zhaoworking commented 4 years ago
(gymlab) root@iZ8vbhynnqk42im5ymgijyZ:~/rl-agents/scripts# python experiments.py evaluate configs/HighwayEnv/env_easy.json configs/HighwayEnv/agents/DQNAgent/baseline.json  --train --episodes=1000 --no-display
pygame 1.9.6
Hello from the pygame community. https://www.pygame.org/contribute.html
INFO: Making new env: highway-v0
/root/anaconda3/lib/python3.6/site-packages/numpy/core/numeric.py:301: FutureWarning: in the future, full((5, 5), -1) will return an array of dtype('int64')
  format(shape, fill_value, array(fill_value).dtype), FutureWarning)
/root/anaconda3/lib/python3.6/site-packages/numpy/core/numeric.py:301: FutureWarning: in the future, full((5, 5), 1) will return an array of dtype('int64')
  format(shape, fill_value, array(fill_value).dtype), FutureWarning)
/root/gym/gym/logger.py:30: UserWarning: WARN: Box bound precision lowered by casting to float32
  warnings.warn(colorize('%s: %s'%('WARN', msg % args), 'yellow'))
[ERROR] Preferred device cuda:best unavailable, switching to default cpu 
INFO: Creating monitor directory out/HighwayEnv/DQNAgent/run_20200409-212448_12995
profiler execution failed
Segmentation fault

Thanks for your patient answers .But when i added the option --no-dispaly,the same issue still occurred.

eleurent commented 4 years ago

Yes I know, which is why I am a bit clueless about what is going on here. Actually, the message INFO: Starting new video recorder writing to /root/rl-agents/scripts/out/HighwayEnv/DQNAgent/run_20200405-230154_7882/openaigym.video.0.7882.video000000.mp4 shows that the gym monitor tried to record the video, which should not happen with the no-display option, and causes the segfault.

Can you add a print statement here to check that video_callable is set to False?

zhaoworking commented 4 years ago
pygame 1.9.6
Hello from the pygame community. https://www.pygame.org/contribute.html
INFO: Making new env: highway-v0
/root/anaconda3/lib/python3.6/site-packages/numpy/core/numeric.py:301: FutureWarning: in the future, full((5, 5), -1) will return an array of dtype('int64')
  format(shape, fill_value, array(fill_value).dtype), FutureWarning)
/root/anaconda3/lib/python3.6/site-packages/numpy/core/numeric.py:301: FutureWarning: in the future, full((5, 5), 1) will return an array of dtype('int64')
  format(shape, fill_value, array(fill_value).dtype), FutureWarning)
/root/gym/gym/logger.py:30: UserWarning: WARN: Box bound precision lowered by casting to float32
  warnings.warn(colorize('%s: %s'%('WARN', msg % args), 'yellow'))
[ERROR] Preferred device cuda:best unavailable, switching to default cpu 
INFO: Creating monitor directory out/HighwayEnv/DQNAgent/run_20200409-215218_13062
profiler execution failed
Segmentation fault

I added the print statement print('video_callable:',video_callable) at the top of the line ,but i can't see either True or False that should be printed out on my computer. And i got the same fault.

eleurent commented 4 years ago

This is strange, I don't know what to think of this. This print statement should happen before the "INFO: Creating monitor directory out/HighwayEnv/DQNAgent/run_20200409-215218_13062" message. I think the best way to solve this is to use a debugger and breakpoints to track which line exactly causes the segfault.