[Bug Report] NormalizeReward does not reset returns on env reset

Farama-Foundation / Gymnasium

An API standard for single-agent reinforcement learning environments, with popular reference environments and related utilities (formerly Gym)

https://gymnasium.farama.org

MIT License

7.14k stars 793 forks source link

[Bug Report] NormalizeReward does not reset returns on env reset #760

Closed sdpkjc closed 9 months ago

sdpkjc commented 11 months ago

Describe the bug

I found that NormalizeReward doesn't implement the reset function and doesn't clear 0 for returns when the environment is reset. especially in truncated environments this can cause a large deviation.

Code example

import gymnasium as gym

env = gym.make('CartPole-v0')
env = gym.wrappers.NormalizeReward(env, gamma=0.99)
env.reset()
print(env.returns)
env.step(env.action_space.sample())
env.step(env.action_space.sample())
print(env.returns)
env.reset()
print(env.returns)

Output:

[0.]
[1.99]
[1.99]

System info

No response

Additional context

No response

Checklist

[X] I have checked that there is no similar issue in the repo

pseudo-rnd-thoughts commented 11 months ago

To me, this was the expected behaviour, in the same way, that for NormalizeObservations, the observations are normalized over all episodes seen rather than just the current episode. @sdpkjc Do you disagree?

sdpkjc commented 11 months ago

Yes, we should normalize the return of all episodes. The reward of each step will be accumulated in self.returns. I think the self.returns variable should be cleared once when a new episode is opened. I've seen the openai/baselines code do the same.

https://github.com/openai/baselines/blob/ea25b9e8b234e6ee1bca43083f8f3cf974143998/baselines/common/vec_env/vec_normalize.py#L44

sdpkjc commented 11 months ago

This problem exists only in NormalizeReward, and NormalizeObservations are correct.

pseudo-rnd-thoughts commented 11 months ago

https://github.com/openai/baselines/blob/ea25b9e8b234e6ee1bca43083f8f3cf974143998/baselines/common/vec_env/vec_normalize.py#L44

This example is for a vector wrapper for NormalizeReward, importantly, as vector environments have autoreset then I don't think this actually clears the rewards for each episode

sdpkjc commented 9 months ago

I find that this is indeed my misunderstanding, thank you @pseudo-rnd-thoughts , close this issue.