araffin / rl-baselines-zoo

A collection of 100+ pre-trained RL agents using Stable Baselines, training and hyperparameter optimization included.
https://stable-baselines.readthedocs.io/
MIT License
1.12k stars 206 forks source link

[question] Transfer hyperparameters from optuna #117

Closed IlonaAT closed 2 years ago

IlonaAT commented 2 years ago

For learning purposes I am tuning a number of algorithms for environment 'MountanCar-v0'. At the moment I am interested in PPO. I intend to share tuned hyperparameters working putting them on your repo. I try to understand the working with some depth of a variety of algorithms hands-on. SB3 and zoo are great tools to get hands-on. So I was using optuna from zoo to find the right parameters for PPO, and by the results produced by it I would say that the hyperparameters should work:

I execute as indicated: train.py --algo ppo --env MountainCar-v0 -n 50000 -optimize --n-trials 1000 --n-jobs 2 --sampler tpe --pruner median Output: ========== MountainCar-v0 ========== Seed: 2520733740 Default hyperparameters for environment (ones being tuned will be overridden): OrderedDict([('ent_coef', 0.0), ('gae_lambda', 0.98), ('gamma', 0.99), ('n_envs', 16), ('n_epochs', 4), ('n_steps', 16), ('n_timesteps', 1000000.0), ('normalize', True), ('policy', 'MlpPolicy')]) Using 16 environments Overwriting n_timesteps with n=50000 Normalization activated: {'gamma': 0.99} Optimizing hyperparameters Sampler: tpe - Pruner: median

Then one nice result is: Trial 151 finished with value: -95.4 and parameters: {'batch_size': 256, 'n_steps': 32, 'gamma': 0.999, 'learning_rate': 0.00043216809397908225, 'ent_coef': 5.844122887301502e-07, 'clip_range': 0.2, 'n_epochs': 10, 'gae_lambda': 0.92, 'max_grad_norm': 2, 'vf_coef': 0.035882158772375855, 'net_arch': 'medium', 'activation_fn': 'relu'}. Best is trial 151 with value: -95.4. Normalization activated: {'gamma': 0.99} Normalization activated: {'gamma': 0.99, 'norm_reward': False}

When passing these hyperparameters to the algorithm it does not work. I do not exactly understand why. envm = make_vec_env("MountainCar-v0", n_envs=16) policy_kwargs = dict(activation_fn=th.nn.ReLU, net_arch=[dict(pi=[254,254], vf=[254,254])]) model = PPO("MlpPolicy", envm, verbose=1, batch_size=256, n_steps=2048, gamma=0.9999, learning_rate=0.00043216809397908225, ent_coef= 5.844122887301502e-07, clip_range=0.2, n_epochs=10, gae_lambda=0.92, max_grad_norm=2 ,vf_coef= 0.035882158772375855, policy_kwargs=policy_kwargs ) model.learn(total_timesteps=1000000) model.save("ppo_mountaincar")

As I read it in the docs, I would say it is supposed to work like that, am I wrong? Should I take something else into account?

araffin commented 2 years ago

Hello, you are apparently posting in the wrong repo ... (SB2 one, not SB3) Please use markdown codeblock to format the code too ;)

There are actually related issues in the other repo (and some PR are going to be made to automate that process): https://github.com/DLR-RM/rl-baselines3-zoo/issues/121

IlonaAT commented 2 years ago

I am sorry for that. I shall format in code blocks as you ask and put the question where it belongs. :) I will check the questions/issues raised in the right repo