clvrai / skill-chaining

Adversarial Skill Chaining for Long-Horizon Robot Manipulation via Terminal State Regularization (CoRL 2021)
https://clvrai.com/skill-chaining
28 stars 4 forks source link

Setting up the repository #1

Closed feup-jmc closed 2 years ago

feup-jmc commented 2 years ago

Hello, Starting this issue to continue from https://github.com/clvrai/furniture/issues/32 which has been closed. I've done as advised and switched to the this repo, which also works out as I'll need to use T-STAR. I am somewhat confused as to whether this repo or https://github.com/clvrai/furniture/tree/tstar is the most appropriate as they both seem to be trying to do the same thing but as per indication I'll stick to this for now.

So far, a few problems I have run into are that step 1 (demo generation) only seems to use 1 core as well as not accepting --gpu 0, making it quite slow. mpirun also doesn't quite work as the demos all overwrite eachother.

A bigger issue however is step 2. I am not very familiar with wandb, but nevertheless made an account and logged in. Regardless, due to some wandb interaction the script crashes. I also cannot test further steps as they seem to depend on the previous ones.

$ python -m run --algo gail --furniture_name table_lack_0825 --demo_path demos/table_lack/Sawyer_table_lack_0825_0 --num_connects 1 --run_prefix p0 --gpu 0

pybullet build time: Mar 12 2022 19:43:28
[2022-04-01 22:43:14,427] Run a base worker.
[2022-04-01 22:43:14,428] Create log directory: log_refactor/table_lack_0825.gail.p0.123
[2022-04-01 22:43:14,428] Create video directory: log_refactor/table_lack_0825.gail.p0.123/video
[2022-04-01 22:43:14,428] Create demo directory: log_refactor/table_lack_0825.gail.p0.123/demo
[2022-04-01 22:43:14,446] Store parameters in log_refactor/table_lack_0825.gail.p0.123/params.json
wandb: Currently logged in as: khalid-rohith-team (use `wandb login --relogin` to force relogin)
wandb: ERROR Error while calling W&B API: project not found (<Response [404]>)
Thread SenderThread:
Traceback (most recent call last):
  File "~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/wandb/sdk/lib/retry.py", line 102, in __call__
    result = self._call_fn(*args, **kwargs)
  File "~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/wandb/sdk/internal/internal_api.py", line 146, in execute
    six.reraise(*sys.exc_info())
  File "~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/six.py", line 719, in reraise
    raise value
  File "~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/wandb/sdk/internal/internal_api.py", line 140, in execute
    return self.client.execute(*args, **kwargs)
  File "~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/wandb/vendor/gql-0.2.0/wandb_gql/client.py", line 52, in execute
    result = self._get_result(document, *args, **kwargs)
  File "~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/wandb/vendor/gql-0.2.0/wandb_gql/client.py", line 60, in _get_result
    return self.transport.execute(document, *args, **kwargs)
  File "~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/wandb/vendor/gql-0.2.0/wandb_gql/transport/requests.py", line 39, in execute
    request.raise_for_status()
  File "~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/requests/models.py", line 960, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://api.wandb.ai/graphql

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/wandb/apis/normalize.py", line 24, in wrapper
    return func(*args, **kwargs)
  File "~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/wandb/sdk/internal/internal_api.py", line 1296, in upsert_run
    response = self.gql(mutation, variable_values=variable_values, **kwargs)
  File "~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/wandb/sdk/lib/retry.py", line 118, in __call__
    if not check_retry_fn(e):
  File "~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/wandb/util.py", line 872, in no_retry_auth
    raise CommError("Permission denied, ask the project owner to grant you access")
wandb.errors.CommError: Permission denied, ask the project owner to grant you access

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/wandb/sdk/internal/internal_util.py", line 54, in run
    self._run()
  File "~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/wandb/sdk/internal/internal_util.py", line 105, in _run
    self._process(record)
  File "~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/wandb/sdk/internal/internal.py", line 312, in _process
    self._sm.send(record)
  File "~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/wandb/sdk/internal/sender.py", line 237, in send
    send_handler(record)
  File "~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/wandb/sdk/internal/sender.py", line 695, in send_run
    self._init_run(run, config_value_dict)
  File "~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/wandb/sdk/internal/sender.py", line 733, in _init_run
    commit=run.git.last_commit or None,
  File "~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/wandb/apis/normalize.py", line 62, in wrapper
    six.reraise(CommError, CommError(message, err), sys.exc_info()[2])
  File "~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/six.py", line 718, in reraise
    raise value.with_traceback(tb)
  File "~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/wandb/apis/normalize.py", line 24, in wrapper
    return func(*args, **kwargs)
  File "~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/wandb/sdk/internal/internal_api.py", line 1296, in upsert_run
    response = self.gql(mutation, variable_values=variable_values, **kwargs)
  File "~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/wandb/sdk/lib/retry.py", line 118, in __call__
    if not check_retry_fn(e):
  File "~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/wandb/util.py", line 872, in no_retry_auth
    raise CommError("Permission denied, ask the project owner to grant you access")
wandb.errors.CommError: Permission denied, ask the project owner to grant you access
wandb: ERROR Internal wandb error: file data was not synced
Problem at: ~/TESTDIR/skill-chaining/method/robot_learning/main.py 143 _make_log_files
Traceback (most recent call last):
  File "~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/wandb/sdk/wandb_init.py", line 954, in init
    run = wi.init()
  File "~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/wandb/sdk/wandb_init.py", line 614, in init
    backend.cleanup()
  File "~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/wandb/sdk/backend/backend.py", line 248, in cleanup
    self.interface.join()
  File "~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/wandb/sdk/interface/interface_shared.py", line 467, in join
    super().join()
  File "~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/wandb/sdk/interface/interface.py", line 630, in join
    _ = self._communicate_shutdown()
  File "~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/wandb/sdk/interface/interface_shared.py", line 464, in _communicate_shutdown
    _ = self._communicate(record)
  File "~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/wandb/sdk/interface/interface_shared.py", line 222, in _communicate
    return self._communicate_async(rec, local=local).get(timeout=timeout)
  File "~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/wandb/sdk/interface/interface_shared.py", line 227, in _communicate_async
    raise Exception("The wandb backend process has shutdown")
Exception: The wandb backend process has shutdown
wandb: ERROR Abnormal program exit
Traceback (most recent call last):
  File "~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/wandb/sdk/wandb_init.py", line 954, in init
    run = wi.init()
  File "~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/wandb/sdk/wandb_init.py", line 614, in init
    backend.cleanup()
  File "~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/wandb/sdk/backend/backend.py", line 248, in cleanup
    self.interface.join()
  File "~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/wandb/sdk/interface/interface_shared.py", line 467, in join
    super().join()
  File "~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/wandb/sdk/interface/interface.py", line 630, in join
    _ = self._communicate_shutdown()
  File "~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/wandb/sdk/interface/interface_shared.py", line 464, in _communicate_shutdown
    _ = self._communicate(record)
  File "~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/wandb/sdk/interface/interface_shared.py", line 222, in _communicate
    return self._communicate_async(rec, local=local).get(timeout=timeout)
  File "~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/wandb/sdk/interface/interface_shared.py", line 227, in _communicate_async
    raise Exception("The wandb backend process has shutdown")
Exception: The wandb backend process has shutdown

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "~/anaconda3/envs/IKEA_1/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "~/anaconda3/envs/IKEA_1/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "~/TESTDIR/skill-chaining/run.py", line 43, in <module>
    SkillChainingRun(parser).run()
  File "~/TESTDIR/skill-chaining/run.py", line 10, in __init__
    super().__init__(parser)
  File "~/TESTDIR/skill-chaining/method/robot_learning/main.py", line 51, in __init__
    self._make_log_files()
  File "~/TESTDIR/skill-chaining/method/robot_learning/main.py", line 143, in _make_log_files
    notes=config.notes,
  File "~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/wandb/sdk/wandb_init.py", line 992, in init
    six.raise_from(Exception("problem"), error_seen)
  File "<string>", line 3, in raise_from
Exception: problem
youngwoon commented 2 years ago

Hi,

  1. Unfortunately, the demo generation code is not supporting multi-processing. Instead, you can run multiple jobs by explicitly specifying the starting index of the demo file, --start_count, and the number of demos to generate, --n_demos'. For example, if you want to use two processes to generate 200 demos, you can run one job with--start_count 0 --n_demos 100and another job with--start_count 100 --n_demos 100`.

I have never used a GPU for demo generation. It probably gives you only marginal speedup.

  1. I'm not sure about the exact error described here. If you want to quickly check whether it works or not, you can disable wandb by specifying --wandb False.

The error said wandb: ERROR Error while calling W&B API: project not found (<Response [404]>), which may mean you need to make a project under your wandb team or personal account, and then specify the project name using the argument --wandb_project [PROJECT NAME].

feup-jmc commented 2 years ago

As always, thanks for the fast feedback. I hope not to be bothering you too much with these frequent replies but I really must get this to work.

I've managed (I think) to solve the multi-core issue with a bit of scripting, although it remains a bit unclear as to whether I should run the full task demo generation or the subtask demo generation (do both work? do I need subtask demos for the next steps?).

As for the wandb issue --wandb False does solve the existing problem but more problems remain - this is a recurrent issue I had with other branches, where gym complains about a missing private attribute:

$ python -m run --algo gail --furniture_name table_lack_0825 --demo_path demos/table_lack/Sawyer_table_lack_0825_0 --num_connects 1 --run_prefix p0 --wandb False --gpu 0

pybullet build time: Mar 12 2022 19:43:28
[2022-04-03 21:32:17,583] Run a base worker.
[2022-04-03 21:32:17,583] Create log directory: log_refactor/table_lack_0825.gail.p0.123
[2022-04-03 21:32:17,583] Create video directory: log_refactor/table_lack_0825.gail.p0.123/video
[2022-04-03 21:32:17,583] Create demo directory: log_refactor/table_lack_0825.gail.p0.123/demo
[2022-04-03 21:32:17,602] Store parameters in log_refactor/table_lack_0825.gail.p0.123/params.json
wandb: WARNING `resume` will be ignored since W&B syncing is set to `offline`. Starting a new run with run id table_lack_0825.gail.p0.123.
wandb: Tracking run with wandb version 0.12.11
wandb: W&B syncing is set to `offline` in this directory.  
wandb: Run `wandb online` or set WANDB_MODE=online to enable cloud syncing.
~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/gym/core.py:173: DeprecationWarning: WARN: Function `env.seed(seed)` is marked as deprecated and will be removed in the future. Please use `env.reset(seed=seed) instead.
  "Function `env.seed(seed)` is marked as deprecated and will be removed in the future. "
Traceback (most recent call last):
  File "~/anaconda3/envs/IKEA_1/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "~/anaconda3/envs/IKEA_1/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "~/TESTDIR/skill-chaining/run.py", line 43, in <module>
    SkillChainingRun(parser).run()
  File "~/TESTDIR/skill-chaining/method/robot_learning/main.py", line 153, in run
    trainer = self._get_trainer()
  File "~/TESTDIR/skill-chaining/run.py", line 26, in _get_trainer
    return super()._get_trainer()
  File "~/TESTDIR/skill-chaining/method/robot_learning/main.py", line 149, in _get_trainer
    return Trainer(self._config)
  File "~/TESTDIR/skill-chaining/method/robot_learning/trainer.py", line 41, in __init__
    self._env = make_env(config.env, config)
  File "~/TESTDIR/skill-chaining/method/robot_learning/environments/__init__.py", line 24, in make_env
    return get_gym_env(name, config)
  File "~/TESTDIR/skill-chaining/method/robot_learning/environments/__init__.py", line 59, in get_gym_env
    return_state=(config.encoder_type == "cnn" and config.asym_ac),
  File "~/TESTDIR/skill-chaining/method/robot_learning/utils/gym_env.py", line 93, in __init__
    self.max_episode_steps = self.env._max_episode_steps // frame_skip
  File "~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/gym/core.py", line 228, in __getattr__
    raise AttributeError(f"attempted to get missing private attribute '{name}'")
AttributeError: attempted to get missing private attribute '_max_episode_steps'

wandb: Waiting for W&B process to finish... (failed 1).
wandb:                                                                                
wandb: You can sync this run to the cloud by running:
wandb: wandb sync log_refactor/table_lack_0825.gail.p0.123/wandb/offline-run-20220403_213217-table_lack_0825.gail.p0.123
wandb: Find logs at: log_refactor/table_lack_0825.gail.p0.123/wandb/offline-run-20220403_213217-table_lack_0825.gail.p0.123/logs
feup-jmc commented 2 years ago

Update: Since as the error says I can't find any private (or even public) attribute _max_episode_steps in gym/core.py (where gym.Wrapper is defined) I have chosen to replace it with a fixed value for now (using 200 as in the sample commands).

Step 2 now runs properly but doesn't seem to work when I try to integrate it with mpirun: There is a brief spike of activity but despite the program still running, nothing happens. I can't to find a way to debug this as it seems to only manifest with mpirun.

image

GPU integration for training remains in the troubleshooting phase as well so could really use mpirun working

youngwoon commented 2 years ago

I'm sorry to hear that you encounter problems running our code. I found that all these problems come from the new updates of gym and wandb packages. I fixed those issues and pushed the changes. The code works fine for me. Please try again with the updated code and submodules.

feup-jmc commented 2 years ago

Thank you! The changes seem to have solved the problem, although despite having 8 threads the program crashes if I try to use more than 4 (large spike at the start crashes the process despite steady state usage probably not needing all system resources).

One question going forward - are the 'success' and 'cpkt' files necessary for the training of the other subtasks and terminal states? How are they generated? Does the training have to conclude? And can't it be made shorter? Since it is expected to take 50h (3-4x faster than before) for a single subtask (100M runs) it would be helpful if I could make sure everything works with shorter training and save the training for a later stage image

feup-jmc commented 2 years ago

Update: Figured out that you can manually change the number of runs with the --max_global_step switch. Also understood that cpkt files are checkpoints. However, the success file issue remains an issue. Even after running the training step (with a shorter number of runs), completing the training doesn't generate the "success" files, making it impossible to fully test the software suite. I understand the following code is responsible but can't understand what the issue is. Hoping for help when possible. https://github.com/clvrai/skill-chaining/blob/02670e4b5d6b669d8b5393bb594675b1cdf48ec9/policy_sequencing_trainer.py#L175-L185