DLR-RM / stable-baselines3

PyTorch version of Stable Baselines, reliable implementations of reinforcement learning algorithms.
https://stable-baselines3.readthedocs.io
MIT License
8.77k stars 1.67k forks source link

DQN example fails #526

Closed MihaiAnca13 closed 3 years ago

MihaiAnca13 commented 3 years ago

📚 Documentation

A clear and concise description of what should be improved in the documentation:

I have accessed the documentation here and looked at section 1.23 DQN. I've tried the given example to solve CartPole-v0, but the model doesn't learn to solve it. Are the parameters set wrong? I've tried A2C in the same environment and it does solve it with no issues. DQN also fails for MountainCar and AcroBot. Thanks!

Here's the code used for completeness:

import gym
from stable_baselines3 import DQN

env = gym.make("CartPole-v0")
model = DQN("MlpPolicy", env, verbose=1)
model.learn(total_timesteps=10000, log_interval=4)

obs = env.reset()
while True:
    action, _states = model.predict(obs, deterministic=True)
    obs, reward, done, info = env.step(action)
    env.render()
    if done:
        obs = env.reset()

 Checklist

Miffyli commented 3 years ago

Good catch! This is expected: the total_timesteps is much lower than learning_starts steps, so there is no actual learning happening. You raise a good point though: the example should show a successful application of DQN. To fix this one should lower "learning_starts" and "exploration_fraction" parameters until learning works. Complete tuned parameters can be found here.

A fix for this would be welcome :). We maintainers are currently on holidays so updates are slow for a moment.

araffin commented 3 years ago

As mentioned in https://github.com/DLR-RM/stable-baselines3/issues/327#issuecomment-890510751 "Examples are here only to show how to quickly run something. Optimized hyperparameters and proper training are done in the RL Zoo."

Maybe we should add such warning somewhere?

MihaiAnca13 commented 3 years ago

Hi again! Thanks for your help!

I was planning on editing the docs with your comments and the simplest configuration that still does the job. Unfortunately, the simplest I could find looks like this:

model = DQN('MlpPolicy', env, verbose=1, learning_starts=1000,
            target_update_interval=10, train_freq=256, gradient_steps=128, policy_kwargs={'net_arch': [256, 256]})
model.learn(total_timesteps=15000)

Should I just add a warning with a link to the right configuration instead? I guess the whole idea of that code snippet is to show how simple it is to use, so you probably don't want many arguments modified.

Miffyli commented 3 years ago

Should I just add a warning with a link to the right configuration instead? I guess the whole idea of that code snippet is to show how simple it is to use, so you probably don't want many arguments modified.

Yup, this would be preferred. The example indeed is to show the simplicity. Include a warning that it is not probably enough to learn, and give a pointer to rl-zoo for those interested in good parameters. Thanks for working on this! :)