[Bug]: Possible inconsistencies with the PPO implementation

rajfly commented 1 month ago

🐛 Bug

I tested different implementations of the PPO algorithm and found some discrepancies among the implementations. I tested each implementation on 56 Atari environments, with five trials per (implementation, environment) permutation. The table below depicts an environment-wise one-way ANOVA to determine the effect of implementation source on mean reward. Out of the 56 environments tested, the implementations significantly differed in nine environments, as seen in the table with respect to Stable Baselines3, CleanRL, and Baselines (not the 108 variant).

I believe that there are inconsistencies among the implementations which causes the observed environment-dependent discrepancies. For example, I found some inconsistencies (i.e., a bug) with Baselines' implementation where the frames per episode did not conform to 108K as per the v4 ALE specification, causing mean rewards to differ significantly in some environments. After correcting this, three out of nine environments previously flagged as statistically different were now not different, as seen in the table above with Baselines108. The inconsistencies is likely to be related to the environments, so I am now investigating parts of Stable Baselines3's implementation which might affect a subset of environments (similar to the frames per episode). Was wondering if they were any specific differences in the implementation by Stable Baselines3 which might have contributed to the differences in performance? Any suggestions would be greatly appreciated :)

To Reproduce

Run Command:

python ppo_atari.py --gpu 0 --env Atlantis --trials 5

The hyperparameters follow that of the original PPO implementation (without LSTM). ppo_atari.py:

import argparse
import json
import os
import pathlib
import time
import uuid

import gymnasium as gym
from stable_baselines3 import PPO
from stable_baselines3.common.env_util import make_atari_env
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.logger import configure
from stable_baselines3.common.torch_layers import NatureCNN
from stable_baselines3.common.utils import get_linear_fn
from stable_baselines3.common.vec_env import DummyVecEnv, VecFrameStack

def train_atari(args):
    env = make_atari_env(
        f"{args.env}NoFrameskip-v4",
        n_envs=8,
        seed=args.seed,
        wrapper_kwargs={
            "noop_max": 30,
            "frame_skip": 4,
            "screen_size": 84,
            "terminal_on_life_loss": True,
            "clip_reward": True,
            "action_repeat_probability": 0.0,
        },
        vec_env_cls=DummyVecEnv,
    )
    env = VecFrameStack(env, n_stack=4)

    model = PPO(
        "CnnPolicy",
        env,
        learning_rate=get_linear_fn(2.5e-4, 0, 1.0),
        n_steps=128,
        batch_size=256,
        n_epochs=4,
        gamma=0.99,
        gae_lambda=0.95,
        clip_range=0.1,
        clip_range_vf=0.1,
        normalize_advantage=True,
        ent_coef=0.01,
        vf_coef=0.5,
        max_grad_norm=float("inf") if args.noclip else 0.5,
        use_sde=False,
        target_kl=None,
        stats_window_size=100,
        policy_kwargs={
            "ortho_init": True,
            "features_extractor_class": NatureCNN,
            "share_features_extractor": True,
            "normalize_images": True,
        },
        seed=args.seed,
    )

    logger = configure(args.path, ["csv"])
    model.set_logger(logger)
    start_time = time.time()
    model.learn(total_timesteps=10000000, log_interval=1, progress_bar=True)
    train_end_time = time.time()
    mean_reward, _ = evaluate_policy(
        model,
        model.get_env(),
        n_eval_episodes=100,
        deterministic=False,
    )
    eval_end_time = time.time()
    args.training_time_h = ((train_end_time - start_time) / 60) / 60
    args.total_time_h = ((eval_end_time - start_time) / 60) / 60
    args.eval_mean_reward = mean_reward

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "-g",
        "--gpu",
        type=int,
        help="Specify GPU index",
        default=0,
    )
    parser.add_argument(
        "-e",
        "--env",
        type=str,
        help="Specify Atari environment w/o version",
        default="Pong",
    )
    parser.add_argument(
        "-t",
        "--trials",
        type=int,
        help="Specify number of trials",
        default=5,
    )
    parser.add_argument(
        "-nc",
        "--noclip",
        action="store_true",
        help="Only specify for no gradient clipping",
    )
    args = parser.parse_args()
    for _ in range(args.trials):
        args.id = uuid.uuid4().hex
        if args.noclip:
            args.path = os.path.join("trials", "ppo", f"{args.env}_NoClip", args.id)
        else:
            args.path = os.path.join("trials", "ppo", args.env, args.id)
        args.seed = int(time.time())

        # create dir
        pathlib.Path(args.path).mkdir(parents=True, exist_ok=True)

        # set gpu
        os.environ["CUDA_VISIBLE_DEVICES"] = f"{args.gpu}"

        train_atari(args)

        # save trial info
        with open(os.path.join(args.path, "info.json"), "w") as f:
            json.dump(vars(args), f, indent=4)

Relevant log output / Error message

No response

System Info

OS: Linux-5.15.0-72-generic-x86_64-with-glibc2.31 # 79~20.04.1-Ubuntu SMP Thu Apr 20 22:12:07 UTC 2023
Python: 3.11.9
Stable-Baselines3: 2.3.0
PyTorch: 2.3.1+cu121
GPU Enabled: True
Numpy: 1.26.4
Cloudpickle: 3.0.0
Gymnasium: 0.29.1

Checklist

[X] My issue does not relate to a custom gym environment. (Use the custom gym env template instead)
[X] I have checked that there is no similar issue in the repo
[X] I have read the documentation
[X] I have provided a minimal and working example to reproduce the bug
[X] I've used the markdown code blocks for both code and stack traces.

araffin commented 1 month ago

Hello, there are two differences that I know:

PPO in SB3 handles timeout properly: https://github.com/DLR-RM/stable-baselines3/issues/1355
deprecated value clipping (not used by default, not recommended) in SB3 PPO is done differently compared to baselines/cleanRL

Other differences might come from using PyTorch vs Tensorflow (for instance, the Adam implementation might be slightly different, the same happened for A2C: https://github.com/DLR-RM/stable-baselines3/pull/110)

rajfly commented 1 month ago

@araffin Thank you for the information. Will investigate more into the mentioned inconsistencies.

DLR-RM / stable-baselines3