[Question] Pong environment with A2C not learning with example code

❓ Question

I copied the code from the Examples section in the documentation, which also uses a PongNoFrameskip-v4 environment with 4 stacked frames. The episodic mean reward starts out around -20, but then worsens, after which it fluctuates between -21 and -20.5. I use the default hyperparameters of the A2C CNN policy, as you can also tell from the code below.

vec_env = make_atari_env("PongNoFrameskip-v4", n_envs=4, seed=0)
vec_env = VecFrameStack(vec_env, n_stack=4)

model = A2C("CnnPolicy", vec_env, verbose=1, device='cuda')
model.learn(total_timesteps=10000000)

I'm running this code using Python 3.10.4 and torch 2.3.0. What could be going wrong here, and shouldn't this example code just work?

Checklist

[X] I have checked that there is no similar issue in the repo
[X] I have read the documentation
[X] If code there is, it is minimal and working
[X] If code there is, it is formatted using the markdown code blocks for both code and stack traces.

DLR-RM / stable-baselines3

[Question] Pong environment with A2C not learning with example code #1917

❓ Question

Checklist