I copied the code from the Examples section in the documentation, which also uses a PongNoFrameskip-v4 environment with 4 stacked frames. The episodic mean reward starts out around -20, but then worsens, after which it fluctuates between -21 and -20.5. I use the default hyperparameters of the A2C CNN policy, as you can also tell from the code below.
if you want the correct hyperparameters for Atari, you should use the RL Zoo.
The example in the doc is there to show the api, we kept it concise to focus on the wrappers we provide.
❓ Question
I copied the code from the Examples section in the documentation, which also uses a PongNoFrameskip-v4 environment with 4 stacked frames. The episodic mean reward starts out around -20, but then worsens, after which it fluctuates between -21 and -20.5. I use the default hyperparameters of the A2C CNN policy, as you can also tell from the code below.
I'm running this code using Python 3.10.4 and torch 2.3.0. What could be going wrong here, and shouldn't this example code just work?
Checklist