facebookresearch / mtrl

Multi Task RL Baselines
MIT License
223 stars 28 forks source link

Reward Normalization Issue #30

Open RobertMcCarthy97 opened 1 year ago

RobertMcCarthy97 commented 1 year ago

Description

I noticed the metaworld environments output rewards normalized by a RMS (see environment initialization and corresponding EnvNormalizationWrapper).

The normalized rewards are saved in the replay buffer (rather than the non-normalized rewards), and when sampled for policy updates the reward values are not updated to reflect the current RMS reward statistics.

Is this an oversight in the code? Presumably this could hurt performance?

How to reproduce

PYTHONPATH=. python3 -u main.py \
setup=metaworld \
env=metaworld-mt10 \
agent=state_sac \
experiment.num_eval_episodes=1 \
experiment.num_train_steps=2000000 \
setup.seed=1 \
replay_buffer.batch_size=1280 \
agent.multitask.num_envs=10 \
agent.multitask.should_use_disentangled_alpha=True \
agent.multitask.should_use_task_encoder=True \
agent.encoder.type_to_select=moe \
agent.multitask.should_use_multi_head_policy=False \
agent.encoder.moe.task_id_to_encoder_id_cfg.mode=attention \
agent.encoder.moe.num_experts=4 \
agent.multitask.task_encoder_cfg.model_cfg.pretrained_embedding_cfg.should_use=True