MohitShridhar / genima

Official Code Repo for GENIMA
https://genima-robot.github.io/
Apache License 2.0
54 stars 0 forks source link

In 'controller': ConfigTypeError raised while composing config: #3

Closed albzni closed 3 months ago

albzni commented 3 months ago

I'm trying to train from scratch, but when I finally get to step #### 4. Train an ACT controller to follow spheres, an error is reported:


In 'controller': ConfigTypeError raised while composing config.
Cannot merge DictConfig with ListConfig
    full_key. 
    object_type=dict

I'm guessing this error is probably due to the definition of env.tasks. But no matter how I change the definition of env.tasks (env.tasks=[take_lid_off_saucepan] or env.tasks=['take_lid_off_saucepan'] or env.tasks= "[take_lid_off_saucepan]") will give this error.

Do you have any suggestions on how to fix this? Thanks!

MohitShridhar commented 3 months ago

@albzni can you post the full command you are trying to run?

albzni commented 3 months ago

@albzni can you post the full command you are trying to run?

The full command is:

python train_act.py \
     env=rlbench \
     env.dataset_root=/data3/share/genima_data/train_data_rnd_bg/ \
     work_dir=/data3/czx/genima/controller \
     demos=25 \
     env.tasks=[take_lid_off_saucepan] \
     num_train_epochs=1000 \
     action_sequence=20 \
     batch_size=8 \
     wandb.use=true
richielo commented 3 months ago

Can you run with HYDRA_FULL_ERROR=1 so we can see the full stack trace?

albzni commented 3 months ago

Can you run with HYDRA_FULL_ERROR=1 so we can see the full stack trace?

Sure! The full error stack message:

Traceback (most recent call last):
  File "/home/czx/anaconda3/envs/genima_env/lib/python3.10/site-packages/hydra/_internal/config_loader_impl.py", line 542, in _compose_config_from_defaults_list
    cfg.merge_with(loaded.config)
  File "/home/czx/anaconda3/envs/genima_env/lib/python3.10/site-packages/omegaconf/basecontainer.py", line 492, in merge_with
    self._format_and_raise(key=None, value=None, cause=e)
  File "/home/czx/anaconda3/envs/genima_env/lib/python3.10/site-packages/omegaconf/base.py", line 231, in _format_and_raise
    format_and_raise(
  File "/home/czx/anaconda3/envs/genima_env/lib/python3.10/site-packages/omegaconf/_utils.py", line 899, in format_and_raise
    _raise(ex, cause)
  File "/home/czx/anaconda3/envs/genima_env/lib/python3.10/site-packages/omegaconf/_utils.py", line 797, in _raise
    raise ex.with_traceback(sys.exc_info()[2])  # set env var OC_CAUSE=1 for full trace
  File "/home/czx/anaconda3/envs/genima_env/lib/python3.10/site-packages/omegaconf/basecontainer.py", line 490, in merge_with
    self._merge_with(*others)
  File "/home/czx/anaconda3/envs/genima_env/lib/python3.10/site-packages/omegaconf/basecontainer.py", line 514, in _merge_with
    BaseContainer._map_merge(self, other)
  File "/home/czx/anaconda3/envs/genima_env/lib/python3.10/site-packages/omegaconf/basecontainer.py", line 399, in _map_merge
    dest_node._merge_with(src_node)
  File "/home/czx/anaconda3/envs/genima_env/lib/python3.10/site-packages/omegaconf/basecontainer.py", line 514, in _merge_with
    BaseContainer._map_merge(self, other)
  File "/home/czx/anaconda3/envs/genima_env/lib/python3.10/site-packages/omegaconf/basecontainer.py", line 399, in _map_merge
    dest_node._merge_with(src_node)
  File "/home/czx/anaconda3/envs/genima_env/lib/python3.10/site-packages/omegaconf/basecontainer.py", line 518, in _merge_with
    raise TypeError("Cannot merge DictConfig with ListConfig")
omegaconf.errors.ConfigTypeError: Cannot merge DictConfig with ListConfig
    full_key: 
    object_type=dict

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/data3/czx/genima/controller/train_act.py", line 296, in <module>
    main()
  File "/home/czx/anaconda3/envs/genima_env/lib/python3.10/site-packages/hydra/main.py", line 94, in decorated_main
    _run_hydra(
  File "/home/czx/anaconda3/envs/genima_env/lib/python3.10/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra
    _run_app(
  File "/home/czx/anaconda3/envs/genima_env/lib/python3.10/site-packages/hydra/_internal/utils.py", line 457, in _run_app
    run_and_report(
  File "/home/czx/anaconda3/envs/genima_env/lib/python3.10/site-packages/hydra/_internal/utils.py", line 223, in run_and_report
    raise ex
  File "/home/czx/anaconda3/envs/genima_env/lib/python3.10/site-packages/hydra/_internal/utils.py", line 220, in run_and_report
    return func()
  File "/home/czx/anaconda3/envs/genima_env/lib/python3.10/site-packages/hydra/_internal/utils.py", line 458, in <lambda>
    lambda: hydra.run(
  File "/home/czx/anaconda3/envs/genima_env/lib/python3.10/site-packages/hydra/_internal/hydra.py", line 105, in run
    cfg = self.compose_config(
  File "/home/czx/anaconda3/envs/genima_env/lib/python3.10/site-packages/hydra/_internal/hydra.py", line 594, in compose_config
    cfg = self.config_loader.load_configuration(
  File "/home/czx/anaconda3/envs/genima_env/lib/python3.10/site-packages/hydra/_internal/config_loader_impl.py", line 142, in load_configuration
    return self._load_configuration_impl(
  File "/home/czx/anaconda3/envs/genima_env/lib/python3.10/site-packages/hydra/_internal/config_loader_impl.py", line 263, in _load_configuration_impl
    cfg = self._compose_config_from_defaults_list(
  File "/home/czx/anaconda3/envs/genima_env/lib/python3.10/site-packages/hydra/_internal/config_loader_impl.py", line 544, in _compose_config_from_defaults_list
    raise ConfigCompositionException(
  File "/home/czx/anaconda3/envs/genima_env/lib/python3.10/site-packages/hydra/_internal/config_loader_impl.py", line 542, in _compose_config_from_defaults_list
    cfg.merge_with(loaded.config)
  File "/home/czx/anaconda3/envs/genima_env/lib/python3.10/site-packages/omegaconf/basecontainer.py", line 492, in merge_with
    self._format_and_raise(key=None, value=None, cause=e)
  File "/home/czx/anaconda3/envs/genima_env/lib/python3.10/site-packages/omegaconf/base.py", line 231, in _format_and_raise
    format_and_raise(
  File "/home/czx/anaconda3/envs/genima_env/lib/python3.10/site-packages/omegaconf/_utils.py", line 899, in format_and_raise
    _raise(ex, cause)
  File "/home/czx/anaconda3/envs/genima_env/lib/python3.10/site-packages/omegaconf/_utils.py", line 797, in _raise
    raise ex.with_traceback(sys.exc_info()[2])  # set env var OC_CAUSE=1 for full trace
  File "/home/czx/anaconda3/envs/genima_env/lib/python3.10/site-packages/omegaconf/basecontainer.py", line 490, in merge_with
    self._merge_with(*others)
  File "/home/czx/anaconda3/envs/genima_env/lib/python3.10/site-packages/omegaconf/basecontainer.py", line 514, in _merge_with
    BaseContainer._map_merge(self, other)
  File "/home/czx/anaconda3/envs/genima_env/lib/python3.10/site-packages/omegaconf/basecontainer.py", line 399, in _map_merge
    dest_node._merge_with(src_node)
  File "/home/czx/anaconda3/envs/genima_env/lib/python3.10/site-packages/omegaconf/basecontainer.py", line 514, in _merge_with
    BaseContainer._map_merge(self, other)
  File "/home/czx/anaconda3/envs/genima_env/lib/python3.10/site-packages/omegaconf/basecontainer.py", line 399, in _map_merge
    dest_node._merge_with(src_node)
  File "/home/czx/anaconda3/envs/genima_env/lib/python3.10/site-packages/omegaconf/basecontainer.py", line 518, in _merge_with
    raise TypeError("Cannot merge DictConfig with ListConfig")
hydra.errors.ConfigCompositionException: In 'controller': ConfigTypeError raised while composing config:
Cannot merge DictConfig with ListConfig
    full_key: 
    object_type=dict
richielo commented 3 months ago

~Does it work if you modify it in controller.yaml instead of overriding it in the command? Sorry I have no Linux means to debug this at the moment. Appreciate the patience~ I think we have found a fix, will push very shortly now

albzni commented 3 months ago

~Does it work if you modify it in controller.yaml instead of overriding it in the command? Sorry I have no Linux means to debug this at the moment. Appreciate the patience~ I think we have found a fix, will push very shortly now

Great! Thank you for your patience in replying๐Ÿ˜„

richielo commented 3 months ago

Can you pull and try again? Replace .tasks with .train_tasks. There is a config name conflict with the underlying robobase hydra config files

albzni commented 3 months ago

Can you pull and try again? Replace .tasks with .train_tasks. There is a config name conflict with the underlying robobase hydra config files

Thank you! This one worked! But when I was able to get it running I ran into another problem:

[2024-07-17 12:21:41,627][root][WARNING] - Multicam fusion is enabled but view_fusion_model is not set!
[2024-07-17 12:21:42,158][root][INFO] -          saving to disk: /tmp/tmp0yh4hpzh
[2024-07-17 12:21:42,160][root][INFO] - Creating a EpochReplayBuffer replay memory with the following parameters:
[2024-07-17 12:21:42,160][root][INFO] -          frame_stack: 1
[2024-07-17 12:21:42,161][root][INFO] -          replay_capacity: 1000000
[2024-07-17 12:21:42,161][root][INFO] -          batch_size: 8
[2024-07-17 12:21:42,161][root][INFO] -          nstep: 3
[2024-07-17 12:21:42,161][root][INFO] -          gamma: 0.990000
Action mean: [-0.1170458   0.01044554  0.12544061 -2.20029259 -0.04368392  2.01885772
  1.71965766  0.5       ]
Action std: [0.65933877 0.45918962 0.68142641 0.44278744 0.58593202 0.48843095
 0.78792626 0.16666667]
Saved to {self.action_stats_path}/action_stats.json
Proprio mean: [ 0.5        -0.1170458   0.01044554  0.12544061 -2.20029259 -0.04368392
  2.01885772  1.71965766]
Proprio std: [0.16666667 0.65933877 0.45918962 0.68142641 0.44278744 0.58593202
 0.48843095 0.78792626]
Saved to {self.proprio_stats_path}/proprio_stats.json
Error executing job with overrides: ['env=rlbench', 'env.dataset_root=/data3/share/genima_data/rendered/train_data_rnd_bg', 'work_dir=/data3/czx/genima/controller', 'demos=25', 'env.train_tasks=[take_shoes_out_of_box]', 'num_train_epochs=1000', 'action_sequence=20', 'batch_size=8', 'wandb.use=false']
Traceback (most recent call last):
  File "/data3/czx/genima/controller/train_act.py", line 292, in main
    workspace.train()
  File "/data3/czx/genima/controller/train_act.py", line 259, in train
    self._load_demos()
  File "/home/czx/anaconda3/envs/genima_env/lib/python3.10/site-packages/robobase/workspace.py", line 489, in _load_demos
    self.env_factory.load_demos_into_replay(self.cfg, self.replay_buffer)
  File "/data3/czx/genima/controller/env/rlbench.py", line 351, in load_demos_into_replay
    add_demo_to_replay_buffer(demo_env, buffer)
  File "/data3/czx/genima/controller/env/rlbench_utils.py", line 248, in add_demo_to_replay_buffer
    replay_buffer.add(act, rew, term, trunc, **obs_and_info)
TypeError: UniformReplayBuffer.add() missing 1 required positional argument: 'truncated'

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

(I've collected data on the "take_shoes_out_ofbox" task, which is not in your paper, to see how well genima works on this task.^^

The RLBench environment is able to render it successfully and my run command is:

python train_act.py \
     env=rlbench \
     env.dataset_root=/data3/share/genima_data/rendered/train_data_rnd_bg \
     work_dir=/data3/czx/genima/controller \
     demos=25 \
     env.train_tasks=[take_shoes_out_of_box] \
     num_train_epochs=1000 \
     action_sequence=20 \
     batch_size=8 \
     wandb.use=false

I tried to solve this bug by myself, but because I am not familiar with the overall code structure, change a place will appear more bugs, so again to seek your help, thank you for your patience๏ผ๐Ÿ™

MohitShridhar commented 3 months ago

@albzni, can you replace that line in rlbench_utils.py with replay_buffer.add(obs, act, rew, term, trunc, **obs_and_info)?

And thanks for your patience! We unfortunately lost access to all our resources and compute before we could properly test things ๐Ÿ˜ข. But I think you are getting close. After fixing train_act.py, everything should work ๐Ÿคž

albzni commented 3 months ago

@albzni, can you replace that line in rlbench_utils.py with replay_buffer.add(obs, act, rew, term, trunc, **obs_and_info)?

And thanks for your patience! We unfortunately lost access to all our resources and compute before we could properly test things ๐Ÿ˜ข. But I think you are getting close. After fixing train_act.py, everything should work ๐Ÿคž

Thank you for your reply! This solution did solve the problem I asked about above (although I still encountered a small error later on, but I've found the cause and solved it myself based on your suggestion (^_^)v ).

Now I can proceed with training the model without any problems, thanks again for your help!๐Ÿ™

MohitShridhar commented 3 months ago

@albzni, that's great!

Can you tell us what the other problem was? So we can fix it if anything is wrong. Thanks!

albzni commented 3 months ago

@MohitShridhar Sure! I encountered the following 3 main errors:

  1. Traceback (most recent call last):
    File "/data3/czx/genima/controller/train_act.py", line 292, in main
    workspace.train()
    File "/data3/czx/genima/controller/train_act.py", line 259, in train
    self._load_demos()
    File "/home/czx/anaconda3/envs/genima_env/lib/python3.10/site-packages/robobase/workspace.py", line 489, in _load_demos
    self.env_factory.load_demos_into_replay(self.cfg, self.replay_buffer)
    File "/data3/czx/genima/controller/env/rlbench.py", line 351, in load_demos_into_replay
    add_demo_to_replay_buffer(demo_env, buffer)
    File "/data3/czx/genima/controller/env/rlbench_utils.py", line 247, in add_demo_to_replay_buffer
    replay_buffer.add(obs, act, rew, term, trunc, **obs_and_info)
    File "/home/czx/anaconda3/envs/genima_env/lib/python3.10/site-packages/robobase/replay_buffer/uniform_replay_buffer.py", line 416, in add
    self._check_add_types(transition, self._storage_signature)
    File "/home/czx/anaconda3/envs/genima_env/lib/python3.10/site-packages/robobase/replay_buffer/uniform_replay_buffer.py", line 533, in _check_add_types
    raise ValueError(
    ValueError: arg front_rgb has shape (1, 3, 256, 256), expected (3, 256, 256)

I add

for key, value in obs.items():
            if isinstance(value, np.ndarray) and value.shape[0] == 1:
                obs[key] = value.squeeze(0)

before replay_buffer.add(obs, act, rew, term, trunc, **obs_and_info) to fix this error.

Traceback (most recent call last):
File "/data3/czx/genima/controller/train_act.py", line 292, in main
workspace.train()
File "/data3/czx/genima/controller/train_act.py", line 259, in train
self._load_demos()
File "/home/czx/anaconda3/envs/genima_env/lib/python3.10/site-packages/robobase/workspace.py", line 489, in _load_demos
self.env_factory.load_demos_into_replay(self.cfg, self.replay_buffer)
File "/data3/czx/genima/controller/env/rlbench.py", line 351, in load_demos_into_replay
add_demo_to_replay_buffer(demo_env, buffer)
File "/data3/czx/genima/controller/env/rlbench_utils.py", line 262, in add_demo_to_replay_buffer
replay_buffer.add_final(**final_obs)
TypeError: UniformReplayBuffer.add_final() got an unexpected keyword argument 'front_rgb'

I replaced replay_buffer.add_final(**final_obs) with replay_buffer.add_final(final_obs) to fix this error.

Unexpected error in training: operands could not be broadcast together with shapes (2,) (3,)
[2024-07-17 16:51:39,481][root][ERROR] - Traceback (most recent call last):
File "/data3/czx/genima/controller/train_act.py", line 218, in _train
self.agent.update(
File "/data3/czx/genima/controller/method/genima_act.py", line 365, in update
batch = next(replay_iter)
File "/data3/czx/genima/controller/utils/dataloader.py", line 93, in next
return self.sample(batch_size=len(batch_indices), indices=batch_indices)
File "/home/czx/anaconda3/envs/genima_env/lib/python3.10/site-packages/robobase/replay_buffer/uniform_replay_buffer.py", line 843, in sample
samples = [self.sample_single(indices[i]) for i in range(batch_size)]
File "/home/czx/anaconda3/envs/genima_env/lib/python3.10/site-packages/robobase/replay_buffer/uniform_replay_buffer.py", line 843, in <listcomp>
samples = [self.sample_single(indices[i]) for i in range(batch_size)]
File "/home/czx/anaconda3/envs/genima_env/lib/python3.10/site-packages/robobase/replay_buffer/uniform_replay_buffer.py", line 829, in sample_single
return self._sample_non_sequential(global_index)
File "/home/czx/anaconda3/envs/genima_env/lib/python3.10/site-packages/robobase/replay_buffer/uniform_replay_buffer.py", line 794, in _sample_non_sequential
episode[REWARD][idx]
ValueError: operands could not be broadcast together with shapes (2,) (3,)

I changed the relevant lines in uniform_replay_buffer.py to๏ผš

    next_idx = min(next_idx, len(episode[TERMINAL]))
    discount_slice_len = next_idx - idx

    ###2024.7.17
    reward_slice = episode[REWARD][idx:next_idx]
    discount_slice = self._cumulative_discount_vector[:discount_slice_len]

    # Adjust the shape if necessary
    if reward_slice.shape != discount_slice.shape:
        min_len = min(reward_slice.shape[0], discount_slice.shape[0])
        reward_slice = reward_slice[:min_len]
        discount_slice = discount_slice[:min_len]
    ###

    replay_sample.update(
        {
            REWARD: np.sum(reward_slice * discount_slice),
            TERMINAL: episode[TERMINAL][next_idx - 1],
            TRUNCATED: episode[TRUNCATED][next_idx - 1],
            INDICES: global_index,
            DISCOUNT: self._gamma**discount_slice_len,  # effective discount
        }
    )

Notably, the above solutions may not always be right, but I'm now able to train successfully. ๐Ÿ˜„

albzni commented 3 months ago

@MohitShridhar Sure! I encountered the following 3 main errors:

Traceback (most recent call last):
File "/data3/czx/genima/controller/train_act.py", line 292, in main
workspace.train()
File "/data3/czx/genima/controller/train_act.py", line 259, in train
self._load_demos()
File "/home/czx/anaconda3/envs/genima_env/lib/python3.10/site-packages/robobase/workspace.py", line 489, in _load_demos
self.env_factory.load_demos_into_replay(self.cfg, self.replay_buffer)
File "/data3/czx/genima/controller/env/rlbench.py", line 351, in load_demos_into_replay
add_demo_to_replay_buffer(demo_env, buffer)
File "/data3/czx/genima/controller/env/rlbench_utils.py", line 247, in add_demo_to_replay_buffer
replay_buffer.add(obs, act, rew, term, trunc, **obs_and_info)
File "/home/czx/anaconda3/envs/genima_env/lib/python3.10/site-packages/robobase/replay_buffer/uniform_replay_buffer.py", line 416, in add
self._check_add_types(transition, self._storage_signature)
File "/home/czx/anaconda3/envs/genima_env/lib/python3.10/site-packages/robobase/replay_buffer/uniform_replay_buffer.py", line 533, in _check_add_types
raise ValueError(
ValueError: arg front_rgb has shape (1, 3, 256, 256), expected (3, 256, 256)

I add

for key, value in obs.items():
            if isinstance(value, np.ndarray) and value.shape[0] == 1:
                obs[key] = value.squeeze(0)

before replay_buffer.add(obs, act, rew, term, trunc, **obs_and_info) to fix this error.

Traceback (most recent call last):
File "/data3/czx/genima/controller/train_act.py", line 292, in main
workspace.train()
File "/data3/czx/genima/controller/train_act.py", line 259, in train
self._load_demos()
File "/home/czx/anaconda3/envs/genima_env/lib/python3.10/site-packages/robobase/workspace.py", line 489, in _load_demos
self.env_factory.load_demos_into_replay(self.cfg, self.replay_buffer)
File "/data3/czx/genima/controller/env/rlbench.py", line 351, in load_demos_into_replay
add_demo_to_replay_buffer(demo_env, buffer)
File "/data3/czx/genima/controller/env/rlbench_utils.py", line 262, in add_demo_to_replay_buffer
replay_buffer.add_final(**final_obs)
TypeError: UniformReplayBuffer.add_final() got an unexpected keyword argument 'front_rgb'

I replaced replay_buffer.add_final(**final_obs) with replay_buffer.add_final(final_obs) to fix this error.

Unexpected error in training: operands could not be broadcast together with shapes (2,) (3,)
[2024-07-17 16:51:39,481][root][ERROR] - Traceback (most recent call last):
File "/data3/czx/genima/controller/train_act.py", line 218, in _train
self.agent.update(
File "/data3/czx/genima/controller/method/genima_act.py", line 365, in update
batch = next(replay_iter)
File "/data3/czx/genima/controller/utils/dataloader.py", line 93, in next
return self.sample(batch_size=len(batch_indices), indices=batch_indices)
File "/home/czx/anaconda3/envs/genima_env/lib/python3.10/site-packages/robobase/replay_buffer/uniform_replay_buffer.py", line 843, in sample
samples = [self.sample_single(indices[i]) for i in range(batch_size)]
File "/home/czx/anaconda3/envs/genima_env/lib/python3.10/site-packages/robobase/replay_buffer/uniform_replay_buffer.py", line 843, in <listcomp>
samples = [self.sample_single(indices[i]) for i in range(batch_size)]
File "/home/czx/anaconda3/envs/genima_env/lib/python3.10/site-packages/robobase/replay_buffer/uniform_replay_buffer.py", line 829, in sample_single
return self._sample_non_sequential(global_index)
File "/home/czx/anaconda3/envs/genima_env/lib/python3.10/site-packages/robobase/replay_buffer/uniform_replay_buffer.py", line 794, in _sample_non_sequential
episode[REWARD][idx]
ValueError: operands could not be broadcast together with shapes (2,) (3,)

I changed the relevant lines in uniform_replay_buffer.py to๏ผš

    next_idx = min(next_idx, len(episode[TERMINAL]))
    discount_slice_len = next_idx - idx

    ###2024.7.17
    reward_slice = episode[REWARD][idx:next_idx]
    discount_slice = self._cumulative_discount_vector[:discount_slice_len]

    # Adjust the shape if necessary
    if reward_slice.shape != discount_slice.shape:
        min_len = min(reward_slice.shape[0], discount_slice.shape[0])
        reward_slice = reward_slice[:min_len]
        discount_slice = discount_slice[:min_len]
    ###

    replay_sample.update(
        {
            REWARD: np.sum(reward_slice * discount_slice),
            TERMINAL: episode[TERMINAL][next_idx - 1],
            TRUNCATED: episode[TRUNCATED][next_idx - 1],
            INDICES: global_index,
            DISCOUNT: self._gamma**discount_slice_len,  # effective discount
        }
    )

Notably, the above solutions may not always be right, but I'm now able to train successfully. ๐Ÿ˜„

By the way, if you find any problems with my solutions above after checking them out (e.g. negatively affecting the training results), please let me know, thank you very much! ๐Ÿ™๐Ÿ™

richielo commented 3 months ago

@albzni Thanks for the detailed information. I think the first two we will push those as fixes. For the third one, if you remove the lines you added (i.e., keep the replay buffer as is) and in controller.yaml, add:

replay:
    nstep: 1

Does that solve the issue?

albzni commented 3 months ago

@albzni Thanks for the detailed information. I think the first two we will push those as fixes. For the third one, if you remove the lines you added (i.e., keep the replay buffer as is) and in controller.yaml, add:

replay:
    nstep: 1

Does that solve the issue?

It works! Thank you so much ๐Ÿ˜„

richielo commented 3 months ago

Great! Thanks for debugging with us!