Closed feup-jmc closed 2 years ago
Hi,
--start_count
, and the number of demos to generate, --n_demos'. For example, if you want to use two processes to generate 200 demos, you can run one job with
--start_count 0 --n_demos 100and another job with
--start_count 100 --n_demos 100`. I have never used a GPU for demo generation. It probably gives you only marginal speedup.
--wandb False
.The error said wandb: ERROR Error while calling W&B API: project not found (<Response [404]>)
, which may mean you need to make a project under your wandb team or personal account, and then specify the project name using the argument --wandb_project [PROJECT NAME]
.
As always, thanks for the fast feedback. I hope not to be bothering you too much with these frequent replies but I really must get this to work.
I've managed (I think) to solve the multi-core issue with a bit of scripting, although it remains a bit unclear as to whether I should run the full task demo generation or the subtask demo generation (do both work? do I need subtask demos for the next steps?).
As for the wandb issue --wandb False
does solve the existing problem but more problems remain - this is a recurrent issue I had with other branches, where gym complains about a missing private attribute:
$ python -m run --algo gail --furniture_name table_lack_0825 --demo_path demos/table_lack/Sawyer_table_lack_0825_0 --num_connects 1 --run_prefix p0 --wandb False --gpu 0
pybullet build time: Mar 12 2022 19:43:28
[2022-04-03 21:32:17,583] Run a base worker.
[2022-04-03 21:32:17,583] Create log directory: log_refactor/table_lack_0825.gail.p0.123
[2022-04-03 21:32:17,583] Create video directory: log_refactor/table_lack_0825.gail.p0.123/video
[2022-04-03 21:32:17,583] Create demo directory: log_refactor/table_lack_0825.gail.p0.123/demo
[2022-04-03 21:32:17,602] Store parameters in log_refactor/table_lack_0825.gail.p0.123/params.json
wandb: WARNING `resume` will be ignored since W&B syncing is set to `offline`. Starting a new run with run id table_lack_0825.gail.p0.123.
wandb: Tracking run with wandb version 0.12.11
wandb: W&B syncing is set to `offline` in this directory.
wandb: Run `wandb online` or set WANDB_MODE=online to enable cloud syncing.
~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/gym/core.py:173: DeprecationWarning: WARN: Function `env.seed(seed)` is marked as deprecated and will be removed in the future. Please use `env.reset(seed=seed) instead.
"Function `env.seed(seed)` is marked as deprecated and will be removed in the future. "
Traceback (most recent call last):
File "~/anaconda3/envs/IKEA_1/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "~/anaconda3/envs/IKEA_1/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "~/TESTDIR/skill-chaining/run.py", line 43, in <module>
SkillChainingRun(parser).run()
File "~/TESTDIR/skill-chaining/method/robot_learning/main.py", line 153, in run
trainer = self._get_trainer()
File "~/TESTDIR/skill-chaining/run.py", line 26, in _get_trainer
return super()._get_trainer()
File "~/TESTDIR/skill-chaining/method/robot_learning/main.py", line 149, in _get_trainer
return Trainer(self._config)
File "~/TESTDIR/skill-chaining/method/robot_learning/trainer.py", line 41, in __init__
self._env = make_env(config.env, config)
File "~/TESTDIR/skill-chaining/method/robot_learning/environments/__init__.py", line 24, in make_env
return get_gym_env(name, config)
File "~/TESTDIR/skill-chaining/method/robot_learning/environments/__init__.py", line 59, in get_gym_env
return_state=(config.encoder_type == "cnn" and config.asym_ac),
File "~/TESTDIR/skill-chaining/method/robot_learning/utils/gym_env.py", line 93, in __init__
self.max_episode_steps = self.env._max_episode_steps // frame_skip
File "~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/gym/core.py", line 228, in __getattr__
raise AttributeError(f"attempted to get missing private attribute '{name}'")
AttributeError: attempted to get missing private attribute '_max_episode_steps'
wandb: Waiting for W&B process to finish... (failed 1).
wandb:
wandb: You can sync this run to the cloud by running:
wandb: wandb sync log_refactor/table_lack_0825.gail.p0.123/wandb/offline-run-20220403_213217-table_lack_0825.gail.p0.123
wandb: Find logs at: log_refactor/table_lack_0825.gail.p0.123/wandb/offline-run-20220403_213217-table_lack_0825.gail.p0.123/logs
Update: Since as the error says I can't find any private (or even public) attribute _max_episode_steps
in gym/core.py
(where gym.Wrapper
is defined) I have chosen to replace it with a fixed value for now (using 200 as in the sample commands).
Step 2 now runs properly but doesn't seem to work when I try to integrate it with mpirun
: There is a brief spike of activity but despite the program still running, nothing happens. I can't to find a way to debug this as it seems to only manifest with mpirun
.
GPU integration for training remains in the troubleshooting phase as well so could really use mpirun
working
I'm sorry to hear that you encounter problems running our code. I found that all these problems come from the new updates of gym
and wandb
packages. I fixed those issues and pushed the changes. The code works fine for me. Please try again with the updated code and submodules.
Thank you! The changes seem to have solved the problem, although despite having 8 threads the program crashes if I try to use more than 4 (large spike at the start crashes the process despite steady state usage probably not needing all system resources).
One question going forward - are the 'success' and 'cpkt' files necessary for the training of the other subtasks and terminal states? How are they generated? Does the training have to conclude? And can't it be made shorter? Since it is expected to take 50h (3-4x faster than before) for a single subtask (100M runs) it would be helpful if I could make sure everything works with shorter training and save the training for a later stage
Update: Figured out that you can manually change the number of runs with the --max_global_step
switch. Also understood that cpkt files are checkpoints. However, the success file issue remains an issue. Even after running the training step (with a shorter number of runs), completing the training doesn't generate the "success" files, making it impossible to fully test the software suite.
I understand the following code is responsible but can't understand what the issue is. Hoping for help when possible.
https://github.com/clvrai/skill-chaining/blob/02670e4b5d6b669d8b5393bb594675b1cdf48ec9/policy_sequencing_trainer.py#L175-L185
Hello, Starting this issue to continue from https://github.com/clvrai/furniture/issues/32 which has been closed. I've done as advised and switched to the this repo, which also works out as I'll need to use T-STAR. I am somewhat confused as to whether this repo or https://github.com/clvrai/furniture/tree/tstar is the most appropriate as they both seem to be trying to do the same thing but as per indication I'll stick to this for now.
So far, a few problems I have run into are that step 1 (demo generation) only seems to use 1 core as well as not accepting
--gpu 0
, making it quite slow.mpirun
also doesn't quite work as the demos all overwrite eachother.A bigger issue however is step 2. I am not very familiar with wandb, but nevertheless made an account and logged in. Regardless, due to some wandb interaction the script crashes. I also cannot test further steps as they seem to depend on the previous ones.