Closed albzni closed 3 months ago
@albzni can you post the full command you are trying to run?
@albzni can you post the full command you are trying to run?
The full command is:
python train_act.py \
env=rlbench \
env.dataset_root=/data3/share/genima_data/train_data_rnd_bg/ \
work_dir=/data3/czx/genima/controller \
demos=25 \
env.tasks=[take_lid_off_saucepan] \
num_train_epochs=1000 \
action_sequence=20 \
batch_size=8 \
wandb.use=true
Can you run with HYDRA_FULL_ERROR=1
so we can see the full stack trace?
Can you run with
HYDRA_FULL_ERROR=1
so we can see the full stack trace?
Sure! The full error stack message:
Traceback (most recent call last):
File "/home/czx/anaconda3/envs/genima_env/lib/python3.10/site-packages/hydra/_internal/config_loader_impl.py", line 542, in _compose_config_from_defaults_list
cfg.merge_with(loaded.config)
File "/home/czx/anaconda3/envs/genima_env/lib/python3.10/site-packages/omegaconf/basecontainer.py", line 492, in merge_with
self._format_and_raise(key=None, value=None, cause=e)
File "/home/czx/anaconda3/envs/genima_env/lib/python3.10/site-packages/omegaconf/base.py", line 231, in _format_and_raise
format_and_raise(
File "/home/czx/anaconda3/envs/genima_env/lib/python3.10/site-packages/omegaconf/_utils.py", line 899, in format_and_raise
_raise(ex, cause)
File "/home/czx/anaconda3/envs/genima_env/lib/python3.10/site-packages/omegaconf/_utils.py", line 797, in _raise
raise ex.with_traceback(sys.exc_info()[2]) # set env var OC_CAUSE=1 for full trace
File "/home/czx/anaconda3/envs/genima_env/lib/python3.10/site-packages/omegaconf/basecontainer.py", line 490, in merge_with
self._merge_with(*others)
File "/home/czx/anaconda3/envs/genima_env/lib/python3.10/site-packages/omegaconf/basecontainer.py", line 514, in _merge_with
BaseContainer._map_merge(self, other)
File "/home/czx/anaconda3/envs/genima_env/lib/python3.10/site-packages/omegaconf/basecontainer.py", line 399, in _map_merge
dest_node._merge_with(src_node)
File "/home/czx/anaconda3/envs/genima_env/lib/python3.10/site-packages/omegaconf/basecontainer.py", line 514, in _merge_with
BaseContainer._map_merge(self, other)
File "/home/czx/anaconda3/envs/genima_env/lib/python3.10/site-packages/omegaconf/basecontainer.py", line 399, in _map_merge
dest_node._merge_with(src_node)
File "/home/czx/anaconda3/envs/genima_env/lib/python3.10/site-packages/omegaconf/basecontainer.py", line 518, in _merge_with
raise TypeError("Cannot merge DictConfig with ListConfig")
omegaconf.errors.ConfigTypeError: Cannot merge DictConfig with ListConfig
full_key:
object_type=dict
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/data3/czx/genima/controller/train_act.py", line 296, in <module>
main()
File "/home/czx/anaconda3/envs/genima_env/lib/python3.10/site-packages/hydra/main.py", line 94, in decorated_main
_run_hydra(
File "/home/czx/anaconda3/envs/genima_env/lib/python3.10/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra
_run_app(
File "/home/czx/anaconda3/envs/genima_env/lib/python3.10/site-packages/hydra/_internal/utils.py", line 457, in _run_app
run_and_report(
File "/home/czx/anaconda3/envs/genima_env/lib/python3.10/site-packages/hydra/_internal/utils.py", line 223, in run_and_report
raise ex
File "/home/czx/anaconda3/envs/genima_env/lib/python3.10/site-packages/hydra/_internal/utils.py", line 220, in run_and_report
return func()
File "/home/czx/anaconda3/envs/genima_env/lib/python3.10/site-packages/hydra/_internal/utils.py", line 458, in <lambda>
lambda: hydra.run(
File "/home/czx/anaconda3/envs/genima_env/lib/python3.10/site-packages/hydra/_internal/hydra.py", line 105, in run
cfg = self.compose_config(
File "/home/czx/anaconda3/envs/genima_env/lib/python3.10/site-packages/hydra/_internal/hydra.py", line 594, in compose_config
cfg = self.config_loader.load_configuration(
File "/home/czx/anaconda3/envs/genima_env/lib/python3.10/site-packages/hydra/_internal/config_loader_impl.py", line 142, in load_configuration
return self._load_configuration_impl(
File "/home/czx/anaconda3/envs/genima_env/lib/python3.10/site-packages/hydra/_internal/config_loader_impl.py", line 263, in _load_configuration_impl
cfg = self._compose_config_from_defaults_list(
File "/home/czx/anaconda3/envs/genima_env/lib/python3.10/site-packages/hydra/_internal/config_loader_impl.py", line 544, in _compose_config_from_defaults_list
raise ConfigCompositionException(
File "/home/czx/anaconda3/envs/genima_env/lib/python3.10/site-packages/hydra/_internal/config_loader_impl.py", line 542, in _compose_config_from_defaults_list
cfg.merge_with(loaded.config)
File "/home/czx/anaconda3/envs/genima_env/lib/python3.10/site-packages/omegaconf/basecontainer.py", line 492, in merge_with
self._format_and_raise(key=None, value=None, cause=e)
File "/home/czx/anaconda3/envs/genima_env/lib/python3.10/site-packages/omegaconf/base.py", line 231, in _format_and_raise
format_and_raise(
File "/home/czx/anaconda3/envs/genima_env/lib/python3.10/site-packages/omegaconf/_utils.py", line 899, in format_and_raise
_raise(ex, cause)
File "/home/czx/anaconda3/envs/genima_env/lib/python3.10/site-packages/omegaconf/_utils.py", line 797, in _raise
raise ex.with_traceback(sys.exc_info()[2]) # set env var OC_CAUSE=1 for full trace
File "/home/czx/anaconda3/envs/genima_env/lib/python3.10/site-packages/omegaconf/basecontainer.py", line 490, in merge_with
self._merge_with(*others)
File "/home/czx/anaconda3/envs/genima_env/lib/python3.10/site-packages/omegaconf/basecontainer.py", line 514, in _merge_with
BaseContainer._map_merge(self, other)
File "/home/czx/anaconda3/envs/genima_env/lib/python3.10/site-packages/omegaconf/basecontainer.py", line 399, in _map_merge
dest_node._merge_with(src_node)
File "/home/czx/anaconda3/envs/genima_env/lib/python3.10/site-packages/omegaconf/basecontainer.py", line 514, in _merge_with
BaseContainer._map_merge(self, other)
File "/home/czx/anaconda3/envs/genima_env/lib/python3.10/site-packages/omegaconf/basecontainer.py", line 399, in _map_merge
dest_node._merge_with(src_node)
File "/home/czx/anaconda3/envs/genima_env/lib/python3.10/site-packages/omegaconf/basecontainer.py", line 518, in _merge_with
raise TypeError("Cannot merge DictConfig with ListConfig")
hydra.errors.ConfigCompositionException: In 'controller': ConfigTypeError raised while composing config:
Cannot merge DictConfig with ListConfig
full_key:
object_type=dict
~Does it work if you modify it in controller.yaml instead of overriding it in the command? Sorry I have no Linux means to debug this at the moment. Appreciate the patience~ I think we have found a fix, will push very shortly now
~Does it work if you modify it in controller.yaml instead of overriding it in the command? Sorry I have no Linux means to debug this at the moment. Appreciate the patience~ I think we have found a fix, will push very shortly now
Great! Thank you for your patience in replying๐
Can you pull and try again? Replace .tasks
with .train_tasks
. There is a config name conflict with the underlying robobase
hydra config files
Can you pull and try again? Replace
.tasks
with.train_tasks
. There is a config name conflict with the underlyingrobobase
hydra config files
Thank you! This one worked! But when I was able to get it running I ran into another problem:
[2024-07-17 12:21:41,627][root][WARNING] - Multicam fusion is enabled but view_fusion_model is not set!
[2024-07-17 12:21:42,158][root][INFO] - saving to disk: /tmp/tmp0yh4hpzh
[2024-07-17 12:21:42,160][root][INFO] - Creating a EpochReplayBuffer replay memory with the following parameters:
[2024-07-17 12:21:42,160][root][INFO] - frame_stack: 1
[2024-07-17 12:21:42,161][root][INFO] - replay_capacity: 1000000
[2024-07-17 12:21:42,161][root][INFO] - batch_size: 8
[2024-07-17 12:21:42,161][root][INFO] - nstep: 3
[2024-07-17 12:21:42,161][root][INFO] - gamma: 0.990000
Action mean: [-0.1170458 0.01044554 0.12544061 -2.20029259 -0.04368392 2.01885772
1.71965766 0.5 ]
Action std: [0.65933877 0.45918962 0.68142641 0.44278744 0.58593202 0.48843095
0.78792626 0.16666667]
Saved to {self.action_stats_path}/action_stats.json
Proprio mean: [ 0.5 -0.1170458 0.01044554 0.12544061 -2.20029259 -0.04368392
2.01885772 1.71965766]
Proprio std: [0.16666667 0.65933877 0.45918962 0.68142641 0.44278744 0.58593202
0.48843095 0.78792626]
Saved to {self.proprio_stats_path}/proprio_stats.json
Error executing job with overrides: ['env=rlbench', 'env.dataset_root=/data3/share/genima_data/rendered/train_data_rnd_bg', 'work_dir=/data3/czx/genima/controller', 'demos=25', 'env.train_tasks=[take_shoes_out_of_box]', 'num_train_epochs=1000', 'action_sequence=20', 'batch_size=8', 'wandb.use=false']
Traceback (most recent call last):
File "/data3/czx/genima/controller/train_act.py", line 292, in main
workspace.train()
File "/data3/czx/genima/controller/train_act.py", line 259, in train
self._load_demos()
File "/home/czx/anaconda3/envs/genima_env/lib/python3.10/site-packages/robobase/workspace.py", line 489, in _load_demos
self.env_factory.load_demos_into_replay(self.cfg, self.replay_buffer)
File "/data3/czx/genima/controller/env/rlbench.py", line 351, in load_demos_into_replay
add_demo_to_replay_buffer(demo_env, buffer)
File "/data3/czx/genima/controller/env/rlbench_utils.py", line 248, in add_demo_to_replay_buffer
replay_buffer.add(act, rew, term, trunc, **obs_and_info)
TypeError: UniformReplayBuffer.add() missing 1 required positional argument: 'truncated'
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
(I've collected data on the "take_shoes_out_ofbox" task, which is not in your paper, to see how well genima works on this task.^^
The RLBench environment is able to render it successfully and my run command is:
python train_act.py \
env=rlbench \
env.dataset_root=/data3/share/genima_data/rendered/train_data_rnd_bg \
work_dir=/data3/czx/genima/controller \
demos=25 \
env.train_tasks=[take_shoes_out_of_box] \
num_train_epochs=1000 \
action_sequence=20 \
batch_size=8 \
wandb.use=false
I tried to solve this bug by myself, but because I am not familiar with the overall code structure, change a place will appear more bugs, so again to seek your help, thank you for your patience๏ผ๐
@albzni, can you replace that line in rlbench_utils.py
with
replay_buffer.add(obs, act, rew, term, trunc, **obs_and_info)
?
And thanks for your patience! We unfortunately lost access to all our resources and compute before we could properly test things ๐ข. But I think you are getting close. After fixing train_act.py
, everything should work ๐ค
@albzni, can you replace that line in
rlbench_utils.py
withreplay_buffer.add(obs, act, rew, term, trunc, **obs_and_info)
?And thanks for your patience! We unfortunately lost access to all our resources and compute before we could properly test things ๐ข. But I think you are getting close. After fixing
train_act.py
, everything should work ๐ค
Thank you for your reply! This solution did solve the problem I asked about above (although I still encountered a small error later on, but I've found the cause and solved it myself based on your suggestion (^_^)v ).
Now I can proceed with training the model without any problems, thanks again for your help!๐
@albzni, that's great!
Can you tell us what the other problem was? So we can fix it if anything is wrong. Thanks!
@MohitShridhar Sure! I encountered the following 3 main errors:
Traceback (most recent call last):
File "/data3/czx/genima/controller/train_act.py", line 292, in main
workspace.train()
File "/data3/czx/genima/controller/train_act.py", line 259, in train
self._load_demos()
File "/home/czx/anaconda3/envs/genima_env/lib/python3.10/site-packages/robobase/workspace.py", line 489, in _load_demos
self.env_factory.load_demos_into_replay(self.cfg, self.replay_buffer)
File "/data3/czx/genima/controller/env/rlbench.py", line 351, in load_demos_into_replay
add_demo_to_replay_buffer(demo_env, buffer)
File "/data3/czx/genima/controller/env/rlbench_utils.py", line 247, in add_demo_to_replay_buffer
replay_buffer.add(obs, act, rew, term, trunc, **obs_and_info)
File "/home/czx/anaconda3/envs/genima_env/lib/python3.10/site-packages/robobase/replay_buffer/uniform_replay_buffer.py", line 416, in add
self._check_add_types(transition, self._storage_signature)
File "/home/czx/anaconda3/envs/genima_env/lib/python3.10/site-packages/robobase/replay_buffer/uniform_replay_buffer.py", line 533, in _check_add_types
raise ValueError(
ValueError: arg front_rgb has shape (1, 3, 256, 256), expected (3, 256, 256)
I add
for key, value in obs.items():
if isinstance(value, np.ndarray) and value.shape[0] == 1:
obs[key] = value.squeeze(0)
before replay_buffer.add(obs, act, rew, term, trunc, **obs_and_info)
to fix this error.
Traceback (most recent call last):
File "/data3/czx/genima/controller/train_act.py", line 292, in main
workspace.train()
File "/data3/czx/genima/controller/train_act.py", line 259, in train
self._load_demos()
File "/home/czx/anaconda3/envs/genima_env/lib/python3.10/site-packages/robobase/workspace.py", line 489, in _load_demos
self.env_factory.load_demos_into_replay(self.cfg, self.replay_buffer)
File "/data3/czx/genima/controller/env/rlbench.py", line 351, in load_demos_into_replay
add_demo_to_replay_buffer(demo_env, buffer)
File "/data3/czx/genima/controller/env/rlbench_utils.py", line 262, in add_demo_to_replay_buffer
replay_buffer.add_final(**final_obs)
TypeError: UniformReplayBuffer.add_final() got an unexpected keyword argument 'front_rgb'
I replaced replay_buffer.add_final(**final_obs)
with replay_buffer.add_final(final_obs)
to fix this error.
Unexpected error in training: operands could not be broadcast together with shapes (2,) (3,)
[2024-07-17 16:51:39,481][root][ERROR] - Traceback (most recent call last):
File "/data3/czx/genima/controller/train_act.py", line 218, in _train
self.agent.update(
File "/data3/czx/genima/controller/method/genima_act.py", line 365, in update
batch = next(replay_iter)
File "/data3/czx/genima/controller/utils/dataloader.py", line 93, in next
return self.sample(batch_size=len(batch_indices), indices=batch_indices)
File "/home/czx/anaconda3/envs/genima_env/lib/python3.10/site-packages/robobase/replay_buffer/uniform_replay_buffer.py", line 843, in sample
samples = [self.sample_single(indices[i]) for i in range(batch_size)]
File "/home/czx/anaconda3/envs/genima_env/lib/python3.10/site-packages/robobase/replay_buffer/uniform_replay_buffer.py", line 843, in <listcomp>
samples = [self.sample_single(indices[i]) for i in range(batch_size)]
File "/home/czx/anaconda3/envs/genima_env/lib/python3.10/site-packages/robobase/replay_buffer/uniform_replay_buffer.py", line 829, in sample_single
return self._sample_non_sequential(global_index)
File "/home/czx/anaconda3/envs/genima_env/lib/python3.10/site-packages/robobase/replay_buffer/uniform_replay_buffer.py", line 794, in _sample_non_sequential
episode[REWARD][idx]
ValueError: operands could not be broadcast together with shapes (2,) (3,)
I changed the relevant lines in uniform_replay_buffer.py
to๏ผ
next_idx = min(next_idx, len(episode[TERMINAL]))
discount_slice_len = next_idx - idx
###2024.7.17
reward_slice = episode[REWARD][idx:next_idx]
discount_slice = self._cumulative_discount_vector[:discount_slice_len]
# Adjust the shape if necessary
if reward_slice.shape != discount_slice.shape:
min_len = min(reward_slice.shape[0], discount_slice.shape[0])
reward_slice = reward_slice[:min_len]
discount_slice = discount_slice[:min_len]
###
replay_sample.update(
{
REWARD: np.sum(reward_slice * discount_slice),
TERMINAL: episode[TERMINAL][next_idx - 1],
TRUNCATED: episode[TRUNCATED][next_idx - 1],
INDICES: global_index,
DISCOUNT: self._gamma**discount_slice_len, # effective discount
}
)
Notably, the above solutions may not always be right, but I'm now able to train successfully. ๐
@MohitShridhar Sure! I encountered the following 3 main errors:
Traceback (most recent call last): File "/data3/czx/genima/controller/train_act.py", line 292, in main workspace.train() File "/data3/czx/genima/controller/train_act.py", line 259, in train self._load_demos() File "/home/czx/anaconda3/envs/genima_env/lib/python3.10/site-packages/robobase/workspace.py", line 489, in _load_demos self.env_factory.load_demos_into_replay(self.cfg, self.replay_buffer) File "/data3/czx/genima/controller/env/rlbench.py", line 351, in load_demos_into_replay add_demo_to_replay_buffer(demo_env, buffer) File "/data3/czx/genima/controller/env/rlbench_utils.py", line 247, in add_demo_to_replay_buffer replay_buffer.add(obs, act, rew, term, trunc, **obs_and_info) File "/home/czx/anaconda3/envs/genima_env/lib/python3.10/site-packages/robobase/replay_buffer/uniform_replay_buffer.py", line 416, in add self._check_add_types(transition, self._storage_signature) File "/home/czx/anaconda3/envs/genima_env/lib/python3.10/site-packages/robobase/replay_buffer/uniform_replay_buffer.py", line 533, in _check_add_types raise ValueError( ValueError: arg front_rgb has shape (1, 3, 256, 256), expected (3, 256, 256)
I add
for key, value in obs.items(): if isinstance(value, np.ndarray) and value.shape[0] == 1: obs[key] = value.squeeze(0)
before
replay_buffer.add(obs, act, rew, term, trunc, **obs_and_info)
to fix this error.Traceback (most recent call last): File "/data3/czx/genima/controller/train_act.py", line 292, in main workspace.train() File "/data3/czx/genima/controller/train_act.py", line 259, in train self._load_demos() File "/home/czx/anaconda3/envs/genima_env/lib/python3.10/site-packages/robobase/workspace.py", line 489, in _load_demos self.env_factory.load_demos_into_replay(self.cfg, self.replay_buffer) File "/data3/czx/genima/controller/env/rlbench.py", line 351, in load_demos_into_replay add_demo_to_replay_buffer(demo_env, buffer) File "/data3/czx/genima/controller/env/rlbench_utils.py", line 262, in add_demo_to_replay_buffer replay_buffer.add_final(**final_obs) TypeError: UniformReplayBuffer.add_final() got an unexpected keyword argument 'front_rgb'
I replaced
replay_buffer.add_final(**final_obs)
withreplay_buffer.add_final(final_obs)
to fix this error.Unexpected error in training: operands could not be broadcast together with shapes (2,) (3,) [2024-07-17 16:51:39,481][root][ERROR] - Traceback (most recent call last): File "/data3/czx/genima/controller/train_act.py", line 218, in _train self.agent.update( File "/data3/czx/genima/controller/method/genima_act.py", line 365, in update batch = next(replay_iter) File "/data3/czx/genima/controller/utils/dataloader.py", line 93, in next return self.sample(batch_size=len(batch_indices), indices=batch_indices) File "/home/czx/anaconda3/envs/genima_env/lib/python3.10/site-packages/robobase/replay_buffer/uniform_replay_buffer.py", line 843, in sample samples = [self.sample_single(indices[i]) for i in range(batch_size)] File "/home/czx/anaconda3/envs/genima_env/lib/python3.10/site-packages/robobase/replay_buffer/uniform_replay_buffer.py", line 843, in <listcomp> samples = [self.sample_single(indices[i]) for i in range(batch_size)] File "/home/czx/anaconda3/envs/genima_env/lib/python3.10/site-packages/robobase/replay_buffer/uniform_replay_buffer.py", line 829, in sample_single return self._sample_non_sequential(global_index) File "/home/czx/anaconda3/envs/genima_env/lib/python3.10/site-packages/robobase/replay_buffer/uniform_replay_buffer.py", line 794, in _sample_non_sequential episode[REWARD][idx] ValueError: operands could not be broadcast together with shapes (2,) (3,)
I changed the relevant lines in
uniform_replay_buffer.py
to๏ผnext_idx = min(next_idx, len(episode[TERMINAL])) discount_slice_len = next_idx - idx ###2024.7.17 reward_slice = episode[REWARD][idx:next_idx] discount_slice = self._cumulative_discount_vector[:discount_slice_len] # Adjust the shape if necessary if reward_slice.shape != discount_slice.shape: min_len = min(reward_slice.shape[0], discount_slice.shape[0]) reward_slice = reward_slice[:min_len] discount_slice = discount_slice[:min_len] ### replay_sample.update( { REWARD: np.sum(reward_slice * discount_slice), TERMINAL: episode[TERMINAL][next_idx - 1], TRUNCATED: episode[TRUNCATED][next_idx - 1], INDICES: global_index, DISCOUNT: self._gamma**discount_slice_len, # effective discount } )
Notably, the above solutions may not always be right, but I'm now able to train successfully. ๐
By the way, if you find any problems with my solutions above after checking them out (e.g. negatively affecting the training results), please let me know, thank you very much! ๐๐
@albzni Thanks for the detailed information. I think the first two we will push those as fixes. For the third one, if you remove the lines you added (i.e., keep the replay buffer as is) and in controller.yaml
, add:
replay:
nstep: 1
Does that solve the issue?
@albzni Thanks for the detailed information. I think the first two we will push those as fixes. For the third one, if you remove the lines you added (i.e., keep the replay buffer as is) and in
controller.yaml
, add:replay: nstep: 1
Does that solve the issue?
It works! Thank you so much ๐
Great! Thanks for debugging with us!
I'm trying to train from scratch, but when I finally get to step #### 4. Train an ACT controller to follow spheres, an error is reported:
I'm guessing this error is probably due to the definition of env.tasks. But no matter how I change the definition of env.tasks (env.tasks=[take_lid_off_saucepan] or env.tasks=['take_lid_off_saucepan'] or env.tasks= "[take_lid_off_saucepan]") will give this error.
Do you have any suggestions on how to fix this? Thanks!