Questions about training performance

Hi， @chernyadev! I follow [https://github.com/robobase-org/robobase.git]'s implementation and run the Diffusion Policy/ACT training code on bigym [https://github.com/chernyadev/bigym/tree/fix_original_demos] but even through extensive hardwork on debugging, I still can't make things work. Here's what I've tried:

I choose dishwasher_close and drawer_top_close tasks for training, where in the original bigym paper, these two tasks' performance are both almost near 100% success rate on Diffusion Policy and ACT. The robobase doesn't figure out how to change the robobase_config.yaml to fit imitation learning pipeline, so I change the robobase_config.yaml according to the bigym paper. Here's my robobase_config:

defaults:
  - _self_
  - env: null
  - method: null
  - intrinsic_reward_module: null
  - launch: null
  - override hydra/launcher: joblib

# Universal settings
create_train_env: true
num_train_envs: 1
replay_size_before_train: 3000
num_pretrain_steps: 1100000
num_train_frames: 1100000
eval_every_steps: 10000
num_eval_episodes: 5
update_every_steps: 2
num_explore_steps: 2000
save_snapshot: false
snapshot_every_n: 10000
batch_size: 256

# Demonstration settings
demos: 200
demo_batch_size: 256 # null  # If set to > 0, introduce a separate buffer for demos
use_self_imitation: true # false  # When using a separate buffer for demos, If set to True, save successful (online) trajectories into the separate demo buffer

# Observation settings
pixels: false
visual_observation_shape: [84, 84]
frame_stack: 2
frame_stack_on_channel: true
use_onehot_time_and_no_bootstrap: false

# Action settings
action_repeat: 1
action_sequence: 16 # ActionSequenceWrapper
execution_length: 8  # If execution_length < action_sequence, we use receding horizon control
temporal_ensemble: true  # Temporal ensemling only applicable to action sequence > 1
temporal_ensemble_gain: 0.01
use_standardization: true  # Demo-based standardization for action space
use_min_max_normalization: true  # Demo-based min-max normalization for action space
min_max_margin: 0.0  # If set to > 0, introduce margin for demo-driven min-max normalization
norm_obs: true

# Replay buffer settings
replay:
  prioritization: false
  size: 1000000
  gamma: 0.99
  demo_size: 1000000
  save_dir: null
  nstep: 3
  num_workers: 4
  pin_memory: true
  alpha: 0.7  # prioritization
  beta: 0.5  # prioritization
  sequential: false
  transition_seq_len: 1  # The length of transition sequence returned from sample() call. Only applicable if sequential is True

# logging settings
wandb:  # weight and bias
  use: true
  project: ${oc.env:USER}RoboBase
  name: null

tb:  # TensorBoard
  use: false
  log_dir: /tmp/robobase_tb_logs
  name: null

# Misc
experiment_name: exp
seed: 1
num_gpus: 1
log_every: 1000
log_train_video: false
log_eval_video: true
log_pretrain_every: 100
save_csv: false

hydra:
  run:
    dir: ./exp_local/${now:%Y.%m.%d}/${now:%H%M%S}_${hydra.job.override_dirname}
  sweep:
    dir: ./exp_local/${now:%Y.%m.%d}/${now:%H%M}_${hydra.job.override_dirname}
    subdir: ${hydra.job.num}

Then I utilize the command to start training on diffusion policy:

python train.py method=act env=bigym/dishwasher_close  replay.nstep=1

But nothing works. Success rate is always zero. Besides, It seems that there's a bug in your temporal ensemble code. In the robobase/robobase/envs/wrappers/action_sequence.py, def step_sequence(), your temporal ensemble code is:

        self._action_history[
            self._cur_step, self._cur_step : self._cur_step + self._sequence_length
        ] = action
        for i, sub_action in enumerate(action):
            if self._temporal_ensemble and self._sequence_length > 1:
                # Select all predicted actions for self._cur_step. This will cover the
                # actions from [cur_step - sequence_length + 1, cur_step)
                # Note that not all actions in this range will be valid as we might have
                # execution_length > 1, which skips some of the intermediate steps.
                cur_actions = self._action_history[:, self._cur_step]
                indices = np.all(cur_actions != 0, axis=1)
                cur_actions = cur_actions[indices]

                # earlier predicted actions will have smaller weights.
                exp_weights = np.exp(-self._gain * np.arange(len(cur_actions)))
                exp_weights = (exp_weights / exp_weights.sum())[:, None]
                sub_action = (cur_actions * exp_weights).sum(axis=0)

            observation, reward, termination, truncation, info = self.env.step(
                sub_action
            )
            self._cur_step += 1
            if self.is_demo_env:
                demo_actions[i] = info.pop("demo_action")
            total_reward += reward
            action_idx_reached += 1
            if termination or truncation:
                break

            if not self.is_demo_env:
                if action_idx_reached == self._execution_length:
                    break

It seems that even being set to temporal ensemble, the function still execute the ‘for' loop and output sub_action multiple times. And each time the sub_action is the same. So I change the code to the following code to avoid this:

        self._action_history[
            self._cur_step, self._cur_step : self._cur_step + self._sequence_length
        ] = action
        for i, sub_action in enumerate(action):
            if self._temporal_ensemble and self._sequence_length > 1:
                # Select all predicted actions for self._cur_step. This will cover the
                # actions from [cur_step - sequence_length + 1, cur_step)
                # Note that not all actions in this range will be valid as we might have
                # execution_length > 1, which skips some of the intermediate steps.
                cur_actions = self._action_history[:, self._cur_step]
                indices = np.all(cur_actions != 0, axis=1)
                cur_actions = cur_actions[indices]

                # earlier predicted actions will have smaller weights.
                exp_weights = np.exp(-self._gain * np.arange(len(cur_actions)))
                exp_weights = (exp_weights / exp_weights.sum())[:, None]
                sub_action = (cur_actions * exp_weights).sum(axis=0)

            observation, reward, termination, truncation, info = self.env.step(
                sub_action
            )
            self._cur_step += 1
            if self.is_demo_env:
                demo_actions[i] = info.pop("demo_action")
            total_reward += reward
            action_idx_reached += 1
            if termination or truncation:
                break
            if self._temporal_ensemble and self._sequence_length > 1:
                break
            if not self.is_demo_env:
                if action_idx_reached == self._execution_length:
                    break

Then I launch another training but still nothing works. Then I take a look at the eval video:

Drawer_top_close: eval_rollout_27000_d9dafc292b347b6e5e92

It seems that the robot completed the task to a certain degree, but the overall success rate is still zero and the task reward is zero, too. This is wired.

Dishwasher_close: eval_rollout_22000_232e8d2b078038481e28

The robot fails to complete the task at all.

Besides I found you use DDIM for training. But as far as I know DDIM can be only used to accelerate sampling of DDPM, but the training code still relies on DDPM. I'm not a diffusion expert so I don't know if I'm wrong.

And so I tried to launch training on ACT code and it turns out:

Error executing job with overrides: ['method=act', 'env=bigym/dishwasher_close', 'replay.nstep=1']
Error in call to target 'robobase.method.act.ActBCAgent':
AttributeError("'NoneType' object has no attribute 'output_shape'")
full_key: method

I figure out that this is because the ACT code only supports image obs, where in above experiments I used state obs. So I change the pixel term in the robobase_config.yaml:

# Observation settings
pixels: true

And it turns out [Inquiry Regarding Using Repo with BiGym · Issue #3 · robobase-org/robobase (github.com)](https://github.com/robobase-org/robobase/issues/3) which remains not answered.

Then I want to know if there's anything wrong on the demos. But I can't run your demo replay code since my server doesn't have a monitor or GUI.

So I'm stuck here and really wish for your help. To be honest, I think your bigym is really an amazing benchmark since it's the only one I know that supports long-horizon mobile manipulation and has plenty of human demos and tasks. I believe that better maintenance on this respository will definitly enlarge the impact of this work in the community. I hope my feedback can aid you to better develop this.

chernyadev / bigym

Questions about training performance #31