DLR-RM / stable-baselines3

PyTorch version of Stable Baselines, reliable implementations of reinforcement learning algorithms.
https://stable-baselines3.readthedocs.io
MIT License
8.72k stars 1.66k forks source link

DQN is not converging even after 15M timesteps #214

Closed MilanVZinzuvadiya closed 3 years ago

MilanVZinzuvadiya commented 3 years ago

Question

I am training Pong-v4/PongNoFrameskip-v4 with DQN. It gives me around ~(-20 to -21) even after 1.5e7 timesteps. I tried various parameters for DQN still it gives me the same output. I could not find proper hyper-parameters for DQN. I think it is a problem with DQN.

Additional context

In the beginning Training of agent start with around -20.4 to -20.2. After 3e6 timesteps it reaches -21 and then it fluctuates in a range between -20.8 to -21.

I tried the following varieties of DQN in which I experimented with different combination of following: learning_starts in [default-50k,5k,100k], gamma [0.98,0.99,0.999], exploration_final_eps [0.02,0.05], learning_rate [1e-3,1e-4,5e-4] and buffer_size [50k,500k,1000k].

Above combination is applied into below code.

model = DQN('CnnPolicy',env,verbose=1,learning_starts=50000,gamma=0.98,exploration_final_eps=0.02,learning_rate=1e-3)
model.learn(total_timesteps=int(1.5e7),log_interval=10)

Since I already tried the above mentions combinations, I tend to think to have a bug in DQN implementation.

Checklist

Miffyli commented 3 years ago

Have you tried using the parameters and/or other code from the zoo repository? I used parameters from SB2-zoo (without priorization/dueling/etc) recently when matching the performance and things worked out as expected (see #110).

MilanVZinzuvadiya commented 3 years ago

Thanks, @Miffyli ! For my research work, I have stuck with this issue for more than 2 weeks. I didn't use exactly the same combination. Furthermore, I didn't use stacking mentioned in parameters.
Now, I started training with the exact same combination mentioned in parameters. I will give you an update after the training.

araffin commented 3 years ago

Yes, as mentioned in the documentation, please use the rl zoo if you want to replicate results. It is only one line:

python train.py --algo dqn --env PongNoFrameskip-v4 --eval-episodes 10 --eval-freq 50000

Note: the reward in evaluation is the clipped one for now (see #181 )

EDIT: I'm currently doing one run too to check and it gives me: Eval num_timesteps=750000, episode_reward=-15.20 +/- 2.36 (so already looking good even before 1M steps)

araffin commented 3 years ago

Closing this as I'm getting Eval num_timesteps=1850000, episode_reward=20.40 +/- 0.66 (so almost perfect score after ~2M steps) with the rl zoo, see command in the previous comment (using SB3 v0.10.0).

chongyi-zheng commented 3 years ago

I got the same issue here after training for 10M on Pong, so is there anything wrong with the benchmark hyperparameters or does the performance depend on pytorch version?

Here is my command

--algo dqn --env PongNoFrameskip-v4

and config log

========== PongNoFrameskip-v4 ==========
Seed: 3242554354
OrderedDict([('batch_size', 32),
             ('buffer_size', 10000),
             ('env_wrapper',
              ['stable_baselines3.common.atari_wrappers.AtariWrapper']),
             ('exploration_final_eps', 0.01),
             ('exploration_fraction', 0.1),
             ('frame_stack', 4),
             ('gradient_steps', 1),
             ('learning_rate', 0.0001),
             ('learning_starts', 100000),
             ('n_timesteps', 10000000.0),
             ('optimize_memory_usage', True),
             ('policy', 'CnnPolicy'),
             ('target_update_interval', 1000),
             ('train_freq', 4)])

I will try the command above today and report it.

araffin commented 3 years ago

I got the same issue here after training for 10M on Pong, so is there anything wrong with the benchmark hyperparameters or does the performance depend on pytorch version?

make sure to have latest SB3, RL Zoo, gym and PyTorch version.

For the v1.0 release, I trained DQN on many environments and below is the learning curve for Pong (because of frameskip, the displayed number of timesteps must be divided by 4):

Training_Episodic_Reward_dqn_pong

You can find the pre-trained agent and associated hyperparameters in the rl-trained-agents folder. To plot the training curve:

python scripts/plot_train.py -a dqn -e Pong -f rl-trained-agents/
chongyi-zheng commented 3 years ago

I have rerun my experiments with different seeds and see a weird result. Currently, the code seems to be seed dependent: I get poor performance with this seed = 1738194436 and promising performance with seed = 3242554354. Would you mind have a run to confirm this?

araffin commented 3 years ago

I have rerun my experiments with different seeds and see a weird result.

See doc "tips and tricks" and "reproducibility":

One thing you can do is augment the replay buffer size to 1e5 or 1e6 (if it fits in your RAM) (I think I may have forgotten to set it back to higher value, even though it seems to work in most cases, cf benchmark).