The normalized rewards are saved in the replay buffer (rather than the non-normalized rewards), and when sampled for policy updates the reward values are not updated to reflect the current RMS reward statistics.
Is this an oversight in the code? Presumably this could hurt performance?
Description
I noticed the metaworld environments output rewards normalized by a RMS (see environment initialization and corresponding EnvNormalizationWrapper).
The normalized rewards are saved in the replay buffer (rather than the non-normalized rewards), and when sampled for policy updates the reward values are not updated to reflect the current RMS reward statistics.
Is this an oversight in the code? Presumably this could hurt performance?
How to reproduce