DLR-RM / stable-baselines3

PyTorch version of Stable Baselines, reliable implementations of reinforcement learning algorithms.
https://stable-baselines3.readthedocs.io
MIT License
8.72k stars 1.66k forks source link

[Bug]: Possible inconsistencies with the PPO implementation #1986

Open rajfly opened 1 month ago

rajfly commented 1 month ago

🐛 Bug

I tested different implementations of the PPO algorithm and found some discrepancies among the implementations. I tested each implementation on 56 Atari environments, with five trials per (implementation, environment) permutation. The table below depicts an environment-wise one-way ANOVA to determine the effect of implementation source on mean reward. Out of the 56 environments tested, the implementations significantly differed in nine environments, as seen in the table with respect to Stable Baselines3, CleanRL, and Baselines (not the 108 variant).

Screenshot 2024-08-02 at 5 18 48 PM

I believe that there are inconsistencies among the implementations which causes the observed environment-dependent discrepancies. For example, I found some inconsistencies (i.e., a bug) with Baselines' implementation where the frames per episode did not conform to 108K as per the v4 ALE specification, causing mean rewards to differ significantly in some environments. After correcting this, three out of nine environments previously flagged as statistically different were now not different, as seen in the table above with Baselines108. The inconsistencies is likely to be related to the environments, so I am now investigating parts of Stable Baselines3's implementation which might affect a subset of environments (similar to the frames per episode). Was wondering if they were any specific differences in the implementation by Stable Baselines3 which might have contributed to the differences in performance? Any suggestions would be greatly appreciated :)

To Reproduce

Run Command:

python ppo_atari.py --gpu 0 --env Atlantis --trials 5

The hyperparameters follow that of the original PPO implementation (without LSTM). ppo_atari.py:

import argparse
import json
import os
import pathlib
import time
import uuid

import gymnasium as gym
from stable_baselines3 import PPO
from stable_baselines3.common.env_util import make_atari_env
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.logger import configure
from stable_baselines3.common.torch_layers import NatureCNN
from stable_baselines3.common.utils import get_linear_fn
from stable_baselines3.common.vec_env import DummyVecEnv, VecFrameStack

def train_atari(args):
    env = make_atari_env(
        f"{args.env}NoFrameskip-v4",
        n_envs=8,
        seed=args.seed,
        wrapper_kwargs={
            "noop_max": 30,
            "frame_skip": 4,
            "screen_size": 84,
            "terminal_on_life_loss": True,
            "clip_reward": True,
            "action_repeat_probability": 0.0,
        },
        vec_env_cls=DummyVecEnv,
    )
    env = VecFrameStack(env, n_stack=4)

    model = PPO(
        "CnnPolicy",
        env,
        learning_rate=get_linear_fn(2.5e-4, 0, 1.0),
        n_steps=128,
        batch_size=256,
        n_epochs=4,
        gamma=0.99,
        gae_lambda=0.95,
        clip_range=0.1,
        clip_range_vf=0.1,
        normalize_advantage=True,
        ent_coef=0.01,
        vf_coef=0.5,
        max_grad_norm=float("inf") if args.noclip else 0.5,
        use_sde=False,
        target_kl=None,
        stats_window_size=100,
        policy_kwargs={
            "ortho_init": True,
            "features_extractor_class": NatureCNN,
            "share_features_extractor": True,
            "normalize_images": True,
        },
        seed=args.seed,
    )

    logger = configure(args.path, ["csv"])
    model.set_logger(logger)
    start_time = time.time()
    model.learn(total_timesteps=10000000, log_interval=1, progress_bar=True)
    train_end_time = time.time()
    mean_reward, _ = evaluate_policy(
        model,
        model.get_env(),
        n_eval_episodes=100,
        deterministic=False,
    )
    eval_end_time = time.time()
    args.training_time_h = ((train_end_time - start_time) / 60) / 60
    args.total_time_h = ((eval_end_time - start_time) / 60) / 60
    args.eval_mean_reward = mean_reward

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "-g",
        "--gpu",
        type=int,
        help="Specify GPU index",
        default=0,
    )
    parser.add_argument(
        "-e",
        "--env",
        type=str,
        help="Specify Atari environment w/o version",
        default="Pong",
    )
    parser.add_argument(
        "-t",
        "--trials",
        type=int,
        help="Specify number of trials",
        default=5,
    )
    parser.add_argument(
        "-nc",
        "--noclip",
        action="store_true",
        help="Only specify for no gradient clipping",
    )
    args = parser.parse_args()
    for _ in range(args.trials):
        args.id = uuid.uuid4().hex
        if args.noclip:
            args.path = os.path.join("trials", "ppo", f"{args.env}_NoClip", args.id)
        else:
            args.path = os.path.join("trials", "ppo", args.env, args.id)
        args.seed = int(time.time())

        # create dir
        pathlib.Path(args.path).mkdir(parents=True, exist_ok=True)

        # set gpu
        os.environ["CUDA_VISIBLE_DEVICES"] = f"{args.gpu}"

        train_atari(args)

        # save trial info
        with open(os.path.join(args.path, "info.json"), "w") as f:
            json.dump(vars(args), f, indent=4)

Relevant log output / Error message

No response

System Info

Checklist

araffin commented 1 month ago

Hello, there are two differences that I know:

Other differences might come from using PyTorch vs Tensorflow (for instance, the Adam implementation might be slightly different, the same happened for A2C: https://github.com/DLR-RM/stable-baselines3/pull/110)

rajfly commented 1 month ago

@araffin Thank you for the information. Will investigate more into the mentioned inconsistencies.