araffin / rl-baselines-zoo

A collection of 100+ pre-trained RL agents using Stable Baselines, training and hyperparameter optimization included.
https://stable-baselines.readthedocs.io/
MIT License
1.12k stars 206 forks source link

Use of normalization seems to be weird with hyperparameter tuning #87

Closed caburu closed 4 years ago

caburu commented 4 years ago

Describe the bug

When we use reward normalization is expected that evaluations are done with original reward values. And this is actually done for training (train.py: lines 291-298). But evaluations in hyperparameter tuning seems to not have the same behavior.

During the hyperparameter tuning, If I use the option normalize: true in the configuration file, rewards are not normalized for the agents neither for the evaluation. And if I use the option normalize: "{'norm_obs':True, 'norm_reward':True}" rewards are normalized for both, the agent and evaluation.

Analysing the code the problem seems to be that after the lines 291-298 are executed, all the envs created with create_env (line 228) use the same normalization parameters. When using normalize: true variable normalize_kwargs is equal to {'norm_reward': False}, and when using normalize: "{'norm_obs':True, 'norm_reward':True}", normalize_kwargs is equal to the {'norm_obs':True, 'norm_reward': False}.

Am I understanding right? Or am I missing something?

Code example

I'm using the following hyperparameter tuning example. The hyperparameter configuration is normalize: true

python train.py --algo ppo2 --env MountainCar-v0 -n 100 -optimize --n-trials 1 --n-jobs 1 --sampler tpe --pruner median

And this is the output:

========== MountainCar-v0 ==========
Seed: 0
OrderedDict([('ent_coef', 0.0),
             ('gamma', 0.99),
             ('lam', 0.98),
             ('n_envs', 16),
             ('n_steps', 16),
             ('n_timesteps', 1000000.0),
             ('nminibatches', 1),
             ('noptepochs', 4),
             ('normalize', True),
             ('policy', 'MlpPolicy')])
Using 16 environments
Overwriting n_timesteps with n=100
Normalizing input and reward
Creating test environment
Normalization activated: {'norm_reward': False}
Optimizing hyperparameters
Sampler: tpe - Pruner: median
Normalization activated: {'norm_reward': False}
Normalization activated: {'norm_reward': False}
[I 2020-06-20 15:51:06,742] Finished trial#0 with value: inf with parameters: {'batch_size': 32, 'n_steps': 2048, 'gamma': 0.999, 'lr': 0.1664672261412471, 'ent_coef': 0.010130230408794483, 'cliprange': 0.4, 'noptepochs': 5, 'lambda': 0.95}. Best is trial#0 with value: inf.
Number of finished trials:  1
Best trial:
Value:  inf
Params: 
    batch_size: 32
    n_steps: 2048
    gamma: 0.999
    lr: 0.1664672261412471
    ent_coef: 0.010130230408794483
    cliprange: 0.4
    noptepochs: 5
    lambda: 0.95
Writing report to logs/ppo2/report_MountainCar-v0_1-trials-100-tpe-median_1592679066.csv

We can see that the envs created before the hyperparameter tuning are created as expected:

Normalizing input and reward
Creating test environment
Normalization activated: {'norm_reward': False}

But the envs created for the hyperparameter tuning does not use reward normalization:

Optimizing hyperparameters
Sampler: tpe - Pruner: median
Normalization activated: {'norm_reward': False}
Normalization activated: {'norm_reward': False}

System Info

I believe that the relevant information is that I'm using latest code version (commit fd9d38862047d7fd4f67be8eb3f6736e093eac9f).

Solution proposal

If you agree this is problem I can open a PR to provide a candidate solution. My idea is to move the code from lines 291-298 to the create_env function, relying on the eval_env parameter, as showed bellow:

            if normalize:
                local_normalize_kwargs = normalize_kwargs.copy()
                if eval_env:
                    if len(local_normalize_kwargs) > 0:                    
                        local_normalize_kwargs['norm_reward'] = False
                    else:
                        local_normalize_kwargs = {'norm_reward': False}            

                if args.verbose > 0:
                    print("Normalization activated: {}".format(local_normalize_kwargs))

                env = VecNormalize(env, **local_normalize_kwargs)

Additional information

The problem seems to be present also in rl-baselines3-zoo. I can open a PR there as well.

araffin commented 4 years ago

Hello,

nice catch... but I'm only half surprised as I was modifying things by reference... Please submit a PR with the fix you suggest ;) In fact, there is an additional fix to do: if args.eval_freq > 0 and not args.optimize_hyperparameters the bug was introduced when I added the evaluation environment...

(once it is merge, you can do the same PR on the rl zoo3)

araffin commented 4 years ago

I fixed it there: https://github.com/DLR-RM/rl-baselines3-zoo/pull/22 I will do the PR for the zoo too.