Why is my SB3 DQN agent unable to learn CartPole-v1 despite using optimal hyperparameters from RLZoo3?

Deepakgthomas commented 4 days ago

📚 Documentation

I obtained optimal hyperparameters for training CartPole-v1 from RLZoo3. I have created a minimal example demonstrating the performance of my CartPole agent. As oer the official docs, the agent should obtain a score of 500, to have a successful episode. Unfortunately, the score doesn't rise above 300.

Here is my code -

import gymnasium as gym
import numpy as np
import torch
from stable_baselines3 import DQN
from stable_baselines3.common.vec_env import DummyVecEnv
from stable_baselines3.common.callbacks import BaseCallback
from torch.utils.tensorboard import SummaryWriter
import os

def set_seed(seed):
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.backends.cudnn.deterministic = True

class TensorBoardCallback(BaseCallback):
    def __init__(self, log_dir):
        super().__init__()
        self.writer = SummaryWriter(log_dir=log_dir)
        self.episode_rewards = []
        self.current_episode_reward = 0

    def _on_step(self):
        self.current_episode_reward += self.locals['rewards'][0]

        if self.locals['dones'][0]:
            self.episode_rewards.append(self.current_episode_reward)
            self.writer.add_scalar('train/episode_reward', self.current_episode_reward, self.num_timesteps)
            self.current_episode_reward = 0

            if len(self.episode_rewards) >= 100:
                avg_reward = sum(self.episode_rewards[-100:]) / 100
                self.writer.add_scalar('train/average_reward', avg_reward, self.num_timesteps)

        return True

    def on_training_end(self):
        self.writer.close()

# Set up logging directory
log_dir = "tensorboard_logs"
os.makedirs(log_dir, exist_ok=True)

# Set seed for reproducibility
seed = 42
set_seed(seed)

# Create environment
env = gym.make("CartPole-v1")
env = DummyVecEnv([lambda: env])

# Create model with hyperparameters from rlzoo3
model = DQN(
    policy="MlpPolicy",
    env=env,
    learning_rate=2.3e-3,
    batch_size=64,
    buffer_size=100000,
    learning_starts=1000,
    gamma=0.99,
    target_update_interval=10,
    train_freq=256,
    gradient_steps=128,
    exploration_fraction=0.16,
    exploration_final_eps=0.04,
    policy_kwargs=dict(net_arch=[256, 256]),
    verbose=1,
    tensorboard_log=log_dir,
    seed=seed
)

# Create callback
tb_callback = TensorBoardCallback(log_dir)

# Train the model
total_timesteps = 50000
model.learn(total_timesteps=total_timesteps, callback=tb_callback)

print("Training completed. You can view the results using TensorBoard.")
print(f"Run the following command in your terminal: tensorboard --logdir {log_dir}")

env.close()

Here is the final result -

Perhaps I am using RLZoo3 wrong? Anyways, I would truly appreciate any and all help regarding this.

Checklist

[X] I have checked that there is no similar issue in the repo

Deepakgthomas commented 4 days ago

I also put up a SO post about it here - https://stackoverflow.com/questions/79083972/why-is-my-sb3-dqn-agent-unable-to-learn-cartpole-v1-despite-using-optimal-hyperp

araffin commented 3 days ago

Unfortunately, the score doesn't rise above 300.

Are you talking about the training reward (average over many episodes) or about the final performance using the (quasi)-deterministic policy?

How many runs did you do?

Did you try using the RL Zoo: python -m rl_zoo3.train --algo dqn --env CartPole-v1 --eval-freq 10000 -P

A simple solution is to also increase the training budget.

DLR-RM / rl-baselines3-zoo

Why is my SB3 DQN agent unable to learn CartPole-v1 despite using optimal hyperparameters from RLZoo3? #472

📚 Documentation

Checklist