DLR-RM / rl-baselines3-zoo

A training framework for Stable Baselines3 reinforcement learning agents, with hyperparameter optimization and pre-trained agents included.
https://rl-baselines3-zoo.readthedocs.io
MIT License
2.11k stars 516 forks source link

[Question] Results vastly different for an agent created with Stable Baselines3 using hyperparameters optimized in RL Baselines3 Zoo. #458

Open mzelazko opened 6 months ago

mzelazko commented 6 months ago

❓ Question

Hello, I first optimize A2C on 1mln steps using RL Baselines3 Zoo:

Firstly i have changed a2c.yml in RL Baselines3 Zoo to work with RAM version of Seaquest:

atari:
  policy: 'MlpPolicy'
  n_envs: 16
  policy_kwargs: "dict(optimizer_class=RMSpropTFLike, optimizer_kwargs=dict(eps=1e-5))"

Then wrote command:

python -m train --algo a2c --env ALE/Seaquest-ram-v5 -n 1000000 -optimize --n-trials 100 --n-startup-trials 10
--sampler tpe --pruner median  --n-evaluations 4 --n-eval-envs 16 --storage "some_valid_database" --study-name test

Top 3 results: mysqlsh_ugrHTRYZRL Then using for example these hyperparameters: mysqlsh_4UsJxMM74z and using this code:

def linear_decay_lr(progress_remaining):
    return 0.00027232300584036946 * progress_remaining
if __name__ == "__main__":
    vec_env = make_vec_env("ALE/Seaquest-ram-v5", n_envs=16)
    model = A2C(
        "MlpPolicy",
        vec_env,
        learning_rate=linear_decay_lr,
        n_steps=256,
        gamma=0.999,
        gae_lambda=0.98,
        ent_coef=0.00001753537605091099,
        vf_coef=0.19195701505334234,
        max_grad_norm=0.5,
        use_rms_prop=True,
        normalize_advantage=False,
        verbose=1,
        tensorboard_log="./seaquest/107",
        policy_kwargs=dict(activation_fn=torch.nn.Tanh, net_arch=dict(pi=[256, 256], vf=[256, 256]), ortho_init=True,
                                      optimizer_class=RMSpropTFLike, optimizer_kwargs=dict(eps=1e-5))
    )
    model.learn(total_timesteps=1000000, log_interval=1)

I get results: firefox_ETT9C6jsTI

As picture shows, result is long way from 456 that RL Baselines Zoo got to. I have used more hyperparameters, but scores are always much lower. What I'm aware of that can have impact on this issue is seed, as I didn't pick the same. Nevertheless I have tried many instances of A2C and the problem remains.

Checklist

araffin commented 5 months ago

Probably a duplicate of https://github.com/DLR-RM/rl-baselines3-zoo/issues/314 https://github.com/DLR-RM/rl-baselines3-zoo/issues/204 and others (see link in the other issues)