Description

I noticed the metaworld environments output rewards normalized by a RMS (see environment initialization and corresponding EnvNormalizationWrapper).

The normalized rewards are saved in the replay buffer (rather than the non-normalized rewards), and when sampled for policy updates the reward values are not updated to reflect the current RMS reward statistics.

Is this an oversight in the code? Presumably this could hurt performance?

How to reproduce

PYTHONPATH=. python3 -u main.py \
setup=metaworld \
env=metaworld-mt10 \
agent=state_sac \
experiment.num_eval_episodes=1 \
experiment.num_train_steps=2000000 \
setup.seed=1 \
replay_buffer.batch_size=1280 \
agent.multitask.num_envs=10 \
agent.multitask.should_use_disentangled_alpha=True \
agent.multitask.should_use_task_encoder=True \
agent.encoder.type_to_select=moe \
agent.multitask.should_use_multi_head_policy=False \
agent.encoder.moe.task_id_to_encoder_id_cfg.mode=attention \
agent.encoder.moe.num_experts=4 \
agent.multitask.task_encoder_cfg.model_cfg.pretrained_embedding_cfg.should_use=True

facebookresearch / mtrl

Reward Normalization Issue #30

Description

How to reproduce