Open Skylark0924 opened 4 years ago
I think the error might be that dm_control
is installed with -e
flag. Can you try reinstalling without it and see if things start working?
I'm pretty sure this has nothing to do with mujoco version.
Thank you very much for responding to me so quickly! And what you said is indeed the solution. Now I can run the code, but another mistake comes out. All of the ray workers died or was killed
(pid=20636) 2019-12-18 11:54:58.253453: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
(pid=20636) 2019-12-18 11:54:58.263643: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3500000000 Hz
(pid=20636) 2019-12-18 11:54:58.265115: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x55585ca51290 executing computations on platform Host. Devices:
(pid=20636) 2019-12-18 11:54:58.265164: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): <undefined>, <undefined>
(pid=20636) Using seed 8695
(pid=20636) Fatal Python error: Segmentation fault
(pid=20636)
(pid=20636) Stack (most recent call first):
(pid=20636) File "/home/lab/Github/multiworld-master/multiworld/envs/mujoco/mujoco_env.py", line 152 in initialize_camera
(pid=20636) File "/home/lab/Github/multiworld-master/multiworld/core/image_env.py", line 75 in __init__
(pid=20636) File "/home/lab/Github/multiworld-master/multiworld/envs/mujoco/__init__.py", line 466 in create_image_48_sawyer_door_pull_hook_v0
(pid=20636) File "/home/lab/anaconda3/envs/softlearning/lib/python3.6/site-packages/gym/envs/registration.py", line 86 in make
(pid=20636) File "/home/lab/anaconda3/envs/softlearning/lib/python3.6/site-packages/gym/envs/registration.py", line 125 in make
(pid=20636) File "/home/lab/anaconda3/envs/softlearning/lib/python3.6/site-packages/gym/envs/registration.py", line 183 in make
(pid=20636) File "/home/lab/Github/reward-learning-rl/softlearning/environments/utils.py", line 48 in get_goal_example_environment_from_variant
(pid=20636) File "/home/lab/Github/reward-learning-rl/examples/classifier_rl/main.py", line 30 in _build
(pid=20636) File "/home/lab/Github/reward-learning-rl/examples/development/main.py", line 77 in _train
(pid=20636) File "/home/lab/anaconda3/envs/softlearning/lib/python3.6/site-packages/ray/tune/trainable.py", line 151 in train
(pid=20636) File "/home/lab/anaconda3/envs/softlearning/lib/python3.6/site-packages/ray/function_manager.py", line 783 in actor_method_executor
(pid=20636) File "/home/lab/anaconda3/envs/softlearning/lib/python3.6/site-packages/ray/worker.py", line 887 in _process_task
(pid=20636) File "/home/lab/anaconda3/envs/softlearning/lib/python3.6/site-packages/ray/worker.py", line 990 in _wait_for_and_process_task
(pid=20636) File "/home/lab/anaconda3/envs/softlearning/lib/python3.6/site-packages/ray/worker.py", line 1039 in main_loop
(pid=20636) File "/home/lab/anaconda3/envs/softlearning/lib/python3.6/site-packages/ray/workers/default_worker.py", line 98 in <module>
2019-12-18 11:55:00,107 ERROR trial_runner.py:494 -- Error processing event.
Traceback (most recent call last):
File "/home/lab/anaconda3/envs/softlearning/lib/python3.6/site-packages/ray/tune/trial_runner.py", line 443, in _process_trial
result = self.trial_executor.fetch_result(trial)
File "/home/lab/anaconda3/envs/softlearning/lib/python3.6/site-packages/ray/tune/ray_trial_executor.py", line 315, in fetch_result
result = ray.get(trial_future[0])
File "/home/lab/anaconda3/envs/softlearning/lib/python3.6/site-packages/ray/worker.py", line 2193, in get
raise value
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
2019-12-18 11:55:00,109 INFO ray_trial_executor.py:179 -- Destroying actor for trial 4a399b7f-algorithm=VICERAQ-seed=8695. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
2019-12-18 11:55:00,110 ERROR worker.py:1672 -- A worker died or was killed while executing task 000000002ceace92ce444f4ed49ec6617ee3c70c.
== Status ==
Using FIFO scheduling algorithm.
Resources requested: 0/20 CPUs, 0/2 GPUs
Memory usage on this node: 7.6/67.2 GB
Result logdir: /home/lab/ray_results/multiworld/mujoco/Image48SawyerDoorPullHookEnv-v0/2019-12-18T11-54-39-2019-12-18T11-54-39
Number of trials: 5 ({'ERROR': 5})
ERROR trials:
- 337b24bd-algorithm=VICERAQ-seed=5170: ERROR, 1 failures: /home/lab/ray_results/multiworld/mujoco/Image48SawyerDoorPullHookEnv-v0/2019-12-18T11-54-39-2019-12-18T11-54-39/337b24bd-algorithm=VICERAQ-seed=5170_2019-12-18_11-54-39o43g6jsz/error_2019-12-18_11-54-44.txt
- f0bcf517-algorithm=VICERAQ-seed=6842: ERROR, 1 failures: /home/lab/ray_results/multiworld/mujoco/Image48SawyerDoorPullHookEnv-v0/2019-12-18T11-54-39-2019-12-18T11-54-39/f0bcf517-algorithm=VICERAQ-seed=6842_2019-12-18_11-54-44ccuk2jv5/error_2019-12-18_11-54-48.txt
- 51db80cc-algorithm=VICERAQ-seed=6234: ERROR, 1 failures: /home/lab/ray_results/multiworld/mujoco/Image48SawyerDoorPullHookEnv-v0/2019-12-18T11-54-39-2019-12-18T11-54-39/51db80cc-algorithm=VICERAQ-seed=6234_2019-12-18_11-54-48rwrncgft/error_2019-12-18_11-54-52.txt
- f3f442e9-algorithm=VICERAQ-seed=8672: ERROR, 1 failures: /home/lab/ray_results/multiworld/mujoco/Image48SawyerDoorPullHookEnv-v0/2019-12-18T11-54-39-2019-12-18T11-54-39/f3f442e9-algorithm=VICERAQ-seed=8672_2019-12-18_11-54-52jx2r68m6/error_2019-12-18_11-54-56.txt
- 4a399b7f-algorithm=VICERAQ-seed=8695: ERROR, 1 failures: /home/lab/ray_results/multiworld/mujoco/Image48SawyerDoorPullHookEnv-v0/2019-12-18T11-54-39-2019-12-18T11-54-39/4a399b7f-algorithm=VICERAQ-seed=8695_2019-12-18_11-54-563zft1ykk/error_2019-12-18_11-55-00.txt
Traceback (most recent call last):
File "/home/lab/anaconda3/envs/softlearning/bin/softlearning", line 11, in <module>
load_entry_point('softlearning', 'console_scripts', 'softlearning')()
File "/home/lab/Github/reward-learning-rl/softlearning/scripts/console_scripts.py", line 202, in main
return cli()
File "/home/lab/anaconda3/envs/softlearning/lib/python3.6/site-packages/click/core.py", line 764, in __call__
return self.main(*args, **kwargs)
File "/home/lab/anaconda3/envs/softlearning/lib/python3.6/site-packages/click/core.py", line 717, in main
rv = self.invoke(ctx)
File "/home/lab/anaconda3/envs/softlearning/lib/python3.6/site-packages/click/core.py", line 1137, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/lab/anaconda3/envs/softlearning/lib/python3.6/site-packages/click/core.py", line 956, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/lab/anaconda3/envs/softlearning/lib/python3.6/site-packages/click/core.py", line 555, in invoke
return callback(*args, **kwargs)
File "/home/lab/Github/reward-learning-rl/softlearning/scripts/console_scripts.py", line 71, in run_example_local_cmd
return run_example_local(example_module_name, example_argv)
File "/home/lab/Github/reward-learning-rl/examples/instrument.py", line 228, in run_example_local
reuse_actors=True)
File "/home/lab/anaconda3/envs/softlearning/lib/python3.6/site-packages/ray/tune/tune.py", line 253, in run
raise TuneError("Trials did not complete", errored_trials)
ray.tune.error.TuneError: ('Trials did not complete', [337b24bd-algorithm=VICERAQ-seed=5170, f0bcf517-algorithm=VICERAQ-seed=6842, 51db80cc-algorithm=VICERAQ-seed=6234, f3f442e9-algorithm=VICERAQ-seed=8672, 4a399b7f-algorithm=VICERAQ-seed=8695])
So sorry to bother you so much!
The main reason for the failure happens is this line:
(pid=20636) Fatal Python error: Segmentation fault
It's unclear where that comes from. Could you try setting a breakpoint (either breakpoint()
or import pdb; pdb.set_trace()
) somewhere in the beginning of the main function, running the code with the sequential debug mode (softlearning run_example_debug ...
), and then stepping through the code to see where it crashes. It's easier to give advice once we know where in code the issue actually happens.
I found that it is an error caused by ray.tune
/home/lab/anaconda3/envs/softlearning/lib/python3.6/site-packages/ray/tune/experiment.py(115)__init__()->None
-> self.spec = spec
(Pdb) r
> /home/lab/anaconda3/envs/softlearning/lib/python3.6/site-packages/ray/tune/tune.py(200)run()
-> checkpoint_dir = _find_checkpoint_dir(experiment)
(Pdb) r
2019-12-18 21:24:28,480 INFO tune.py:64 -- Did not find checkpoint file in /home/lab/ray_results/multiworld/mujoco/Image48SawyerDoorPullHookEnv-v0/2019-12-18T21-22-34-2019-12-18T21-22-33.
2019-12-18 21:24:28,481 INFO tune.py:211 -- Starting a new experiment.
== Status ==
Using FIFO scheduling algorithm.
Resources requested: 0/20 CPUs, 0/2 GPUs
Memory usage on this node: 8.3/67.2 GB
Using seed 8002
Fatal Python error: Segmentation fault
Stack (most recent call first):
File "/home/lab/Github/multiworld-master/multiworld/envs/mujoco/mujoco_env.py", line 152 in initialize_camera
File "/home/lab/Github/multiworld-master/multiworld/core/image_env.py", line 75 in __init__
File "/home/lab/Github/multiworld-master/multiworld/envs/mujoco/__init__.py", line 466 in create_image_48_sawyer_door_pull_hook_v0
File "/home/lab/anaconda3/envs/softlearning/lib/python3.6/site-packages/gym/envs/registration.py", line 86 in make
File "/home/lab/anaconda3/envs/softlearning/lib/python3.6/site-packages/gym/envs/registration.py", line 125 in make
File "/home/lab/anaconda3/envs/softlearning/lib/python3.6/site-packages/gym/envs/registration.py", line 183 in make
File "/home/lab/Github/reward-learning-rl/softlearning/environments/utils.py", line 48 in get_goal_example_environment_from_variant
File "/home/lab/Github/reward-learning-rl/examples/classifier_rl/main.py", line 32 in _build
File "/home/lab/Github/reward-learning-rl/examples/development/main.py", line 77 in _train
File "/home/lab/anaconda3/envs/softlearning/lib/python3.6/site-packages/ray/tune/trainable.py", line 151 in train
File "/home/lab/anaconda3/envs/softlearning/lib/python3.6/site-packages/ray/actor.py", line 479 in _actor_method_call
File "/home/lab/anaconda3/envs/softlearning/lib/python3.6/site-packages/ray/actor.py", line 138 in _remote
File "/home/lab/anaconda3/envs/softlearning/lib/python3.6/site-packages/ray/actor.py", line 124 in remote
File "/home/lab/anaconda3/envs/softlearning/lib/python3.6/site-packages/ray/tune/ray_trial_executor.py", line 111 in _train
File "/home/lab/anaconda3/envs/softlearning/lib/python3.6/site-packages/ray/tune/ray_trial_executor.py", line 143 in _start_trial
File "/home/lab/anaconda3/envs/softlearning/lib/python3.6/site-packages/ray/tune/ray_trial_executor.py", line 201 in start_trial
File "/home/lab/anaconda3/envs/softlearning/lib/python3.6/site-packages/ray/tune/trial_runner.py", line 271 in step
File "/home/lab/anaconda3/envs/softlearning/lib/python3.6/site-packages/ray/tune/tune.py", line 235 in run
File "/home/lab/Github/reward-learning-rl/examples/instrument.py", line 237 in run_example_local
File "/home/lab/Github/reward-learning-rl/examples/instrument.py", line 264 in run_example_debug
File "/home/lab/Github/reward-learning-rl/softlearning/scripts/console_scripts.py", line 81 in run_example_debug_cmd
File "/home/lab/anaconda3/envs/softlearning/lib/python3.6/site-packages/click/core.py", line 555 in invoke
File "/home/lab/anaconda3/envs/softlearning/lib/python3.6/site-packages/click/core.py", line 956 in invoke
File "/home/lab/anaconda3/envs/softlearning/lib/python3.6/site-packages/click/core.py", line 1137 in invoke
File "/home/lab/anaconda3/envs/softlearning/lib/python3.6/site-packages/click/core.py", line 717 in main
File "/home/lab/anaconda3/envs/softlearning/lib/python3.6/site-packages/click/core.py", line 764 in __call__
File "/home/lab/Github/reward-learning-rl/softlearning/scripts/console_scripts.py", line 202 in main
File "/home/lab/anaconda3/envs/softlearning/bin/softlearning", line 11 in <module>
Segmentation fault (core dumped)
Is this useful to you?
I'm pretty sure it doesn't come from Tune itself, but it looks it does because the code is run through Tune and thus the errors get propagated through it.
Could you try stepping in the main function using pdb and checking where it fails? It's very likely that if fails in the RLAlgorithm.train
.
@Skylark0924 hello! I have the same error (Fatal Python error: Segmentation fault), are you solve it? Could you give me some advice, thank you very much!
@hit618 Sorry, I couldn't fix it at the end. That might be a problem caused by the firewall and the incomplete install of the env.
Excuse me, I am confused about the above problem when I run the code:
The whole error log is as follows:
Actually, I have used all the requirements you mentioned and the version of dm_control is correct. But there is one thing I need to mention is that https://github.com/deepmind/dm_control.git@0260f3effcfe2b0fdb25d9652dc27ba34b52d226 need
mujoco200
in its setup.py. So I wonder how you use this version andmujoco150
at the same time.Thanks a lot!