Closed sdpkjc closed 9 months ago
To me, this was the expected behaviour, in the same way, that for NormalizeObservations
, the observations are normalized over all episodes seen rather than just the current episode.
@sdpkjc Do you disagree?
Yes, we should normalize the return of all episodes. The reward of each step will be accumulated in self.returns
. I think the self.returns
variable should be cleared once when a new episode is opened. I've seen the openai/baselines code do the same.
This problem exists only in NormalizeReward
, and NormalizeObservations
are correct.
This example is for a vector wrapper for NormalizeReward, importantly, as vector environments have autoreset then I don't think this actually clears the rewards for each episode
I find that this is indeed my misunderstanding, thank you @pseudo-rnd-thoughts , close this issue.
Describe the bug
I found that NormalizeReward doesn't implement the reset function and doesn't clear 0 for returns when the environment is reset. especially in truncated environments this can cause a large deviation.
Code example
Output:
System info
No response
Additional context
No response
Checklist