Closed caburu closed 4 years ago
Hello,
nice catch... but I'm only half surprised as I was modifying things by reference...
Please submit a PR with the fix you suggest ;)
In fact, there is an additional fix to do:
if args.eval_freq > 0 and not args.optimize_hyperparameters
the bug was introduced when I added the evaluation environment...
(once it is merge, you can do the same PR on the rl zoo3)
I fixed it there: https://github.com/DLR-RM/rl-baselines3-zoo/pull/22 I will do the PR for the zoo too.
Describe the bug
When we use reward normalization is expected that evaluations are done with original reward values. And this is actually done for training (train.py: lines 291-298). But evaluations in hyperparameter tuning seems to not have the same behavior.
During the hyperparameter tuning, If I use the option
normalize: true
in the configuration file, rewards are not normalized for the agents neither for the evaluation. And if I use the optionnormalize: "{'norm_obs':True, 'norm_reward':True}"
rewards are normalized for both, the agent and evaluation.Analysing the code the problem seems to be that after the lines 291-298 are executed, all the envs created with
create_env
(line 228) use the same normalization parameters. When usingnormalize: true
variablenormalize_kwargs
is equal to{'norm_reward': False}
, and when usingnormalize: "{'norm_obs':True, 'norm_reward':True}"
,normalize_kwargs
is equal to the{'norm_obs':True, 'norm_reward': False}
.Am I understanding right? Or am I missing something?
Code example
I'm using the following hyperparameter tuning example. The hyperparameter configuration is
normalize: true
And this is the output:
We can see that the envs created before the hyperparameter tuning are created as expected:
But the envs created for the hyperparameter tuning does not use reward normalization:
System Info
I believe that the relevant information is that I'm using latest code version (commit fd9d38862047d7fd4f67be8eb3f6736e093eac9f).
Solution proposal
If you agree this is problem I can open a PR to provide a candidate solution. My idea is to move the code from lines 291-298 to the
create_env
function, relying on theeval_env
parameter, as showed bellow:Additional information
The problem seems to be present also in rl-baselines3-zoo. I can open a PR there as well.