Training results are not the same after 100k steps

Hi, I just ran the sb3_highway_dqn.py script at head (20k steps), and here is what I get: logs

----------------------------------
| rollout/            |          |
|    ep_len_mean      | 9.75     |
|    ep_rew_mean      | 6.78     |
|    exploration_rate | 0.981    |
| time/               |          |
|    episodes         | 4        |
|    fps              | 13       |
|    time_elapsed     | 2        |
|    total_timesteps  | 39       |
----------------------------------

...

----------------------------------
| rollout/            |          |
|    ep_len_mean      | 25.2     |
|    ep_rew_mean      | 19.7     |
|    exploration_rate | 0.05     |
| time/               |          |
|    episodes         | 1056     |
|    fps              | 12       |
|    time_elapsed     | 1566     |
|    total_timesteps  | 19894    |
| train/              |          |
|    learning_rate    | 0.0005   |
|    loss             | 0.126    |
|    n_updates        | 19693    |
----------------------------------

so mean reward improved from 6 to 19, and episode length from 9 to 25.

Here are 10 (non-cherry-picked) test episodes

https://github.com/Farama-Foundation/HighwayEnv/assets/1706935/258e214e-8d2b-440a-b380-79aaa54db762

https://github.com/Farama-Foundation/HighwayEnv/assets/1706935/14ad0a05-c5ff-4d18-9f56-d894d77ec2f1

https://github.com/Farama-Foundation/HighwayEnv/assets/1706935/fb0119dc-13fb-403b-b4d1-6f0e9c7f87b1

https://github.com/Farama-Foundation/HighwayEnv/assets/1706935/35b7dff7-e03a-42c1-bc9e-91d196f9dc4b

https://github.com/Farama-Foundation/HighwayEnv/assets/1706935/4ac6063f-2ed4-4c08-be55-9aaa81d348de

https://github.com/Farama-Foundation/HighwayEnv/assets/1706935/19e838c0-4bc8-454c-bac0-090b7a305fd5

https://github.com/Farama-Foundation/HighwayEnv/assets/1706935/cc83377d-b223-4f00-a5f1-16b9f8d7c204

While they are not perfect by all means, I think they show some situational awareness, at least the vehicle doesnt just crash into the first vehicle on the highway like in your case, so I'm not sure what is going on. If you have similar metrics (reward, episode length) while training, maybe you are not loading the checkpoint correctly at test time?

Edit :maybe there was a slight regression compared to the Getting Started version: all 5 runs there get roughly 25 mean reward, while my last training only reached 20... and the behaviours qualitatively look a bit more conservative than the Getting Started video. It's probably worth running a few more experiemnts to check if this regression is reproducible.

Farama-Foundation / HighwayEnv

Training results are not the same after 100k steps #460