[Bug Report] SB3 Connect four tutorial does not train properly.

InigoMoreno commented 7 months ago

Describe the bug

I am trying to run the connect four tutorial with SB3 (https://github.com/Farama-Foundation/PettingZoo/blob/master/tutorials/SB3/connect_four/sb3_connect_four_action_mask.py). In the tutorials the following winrates are reported:

10k steps: Winrate: 0.76, loss order of 1e-03
20k steps: Winrate: 0.86, loss order of 1e-04
40k steps: Winrate: 0.86, loss order of 7e-06

But I get the following:

10k steps: Winrate: 0.48, loss order of 3e-03
20k steps: Winrate: 0.25, loss order of 2e-03
40k steps: Winrate: 0.27, loss order of 3e-03

As you can see, the agent is not learning, it's even losing to the random agent.

Code example

https://github.com/Farama-Foundation/PettingZoo/blob/master/tutorials/SB3/connect_four/sb3_connect_four_action_mask.py

System info

Ubuntu 22.04.3 LTS
Python 3.10.12
PettingZoo installed through pip

$ pip list | grep -e petting -e sb3 -e stable                                                                                                                                                 [9:15:34]
pettingzoo                           1.24.2
sb3-contrib                          2.2.1
stable-baselines3                    2.2.1

Additional context

No response

Checklist

[X] I have checked that there is no similar issue in the repo

elliottower commented 7 months ago

Thanks for letting me know, unfortunately I don’t have the bandwidth to dive into the details and fix this, have gotten some other comments about it in the discord server as well. Am planning to add a disclaimer that it’s mainly a proof of concept and adaptation of an old medium article that people had asked about a lot. That article had an order of a million steps for training whereas for my case I only had a laptop and did ten to twenty thousand.

Those notes in the code were basically just for my own sake testing out hyper parameters, but I’m fairly sure that I just got lucky because it should take a lot longer to train a decent policy. With random seeds we do have GitHub actions CI tests that check some basic results like that the training reaches less than for one env, greater than 50% for another env, etc. But it’s hard to guarantee anything for other seeds and I’ve had a number of people say they didn’t find things working very well, if you or anyone else seeing this has ideas for improvements or bug fixes feel free to suggest them. Maybe you could test training it for longer with different hyper parameters to see if it can converge on a good policy eventually?

MischaU8 commented 3 months ago

The unexpected winrates happen because after calling env.step on a PettingZoo's AECEnv, the env.agent_selection is set to the next agent, at least for connect_four_v3. https://github.com/Farama-Foundation/PettingZoo/blob/849414dfcba7f423f4db02bfecc7ef061cccc567/pettingzoo/classic/connect_four/connect_four.py#L195-L205

So when you call env.last after a step, it will return the reward etc from the point of view of the next agent, not the one taking the step. This causes model.learn to learn how to lose instead of winning.

After patching SB3ActionMaskWrapper.step to return the negative reward, the model returns results that are in line with the "Evaluation/training hyperparameter notes" from the comments.

    def step(self, action):
        """Gymnasium-like step function, returning observation, reward, termination, truncation, info."""
        super().step(action)
        obs, reward, termination, truncation, info = super().last()
        return obs, -reward, termination, truncation, info

Before:

10k steps: Winrate: 0.46
20k steps: Winrate: 0.35
40k steps: Winrate: 0.21

After:

10k steps: Winrate: 0.77
20k steps: Winrate: 0.79
40k steps: Winrate: 0.84

There has probably been some API change inside PettingZoo at some point that caused this tutorial to break. Other tutorials that depend on AECEnv such as the AgileRL tutorial are also affected.

InigoMoreno commented 2 months ago

The unexpected winrates happen because after calling env.step on a PettingZoo's AECEnv, the env.agent_selection is set to the next agent, at least for connect_four_v3.

https://github.com/Farama-Foundation/PettingZoo/blob/849414dfcba7f423f4db02bfecc7ef061cccc567/pettingzoo/classic/connect_four/connect_four.py#L195-L205

So when you call env.last after a step, it will return the reward etc from the point of view of the next agent, not the one taking the step. This causes model.learn to learn how to lose instead of winning.

After patching SB3ActionMaskWrapper.step to return the negative reward, the model returns results that are in line with the "Evaluation/training hyperparameter notes" from the comments.
    def step(self, action):
        """Gymnasium-like step function, returning observation, reward, termination, truncation, info."""
        super().step(action)
        obs, reward, termination, truncation, info = super().last()
        return obs, -reward, termination, truncation, info
Before:
* 10k steps: Winrate: 0.46

* 20k steps: Winrate: 0.35

* 40k steps: Winrate: 0.21
After:
* 10k steps: Winrate: 0.77

* 20k steps: Winrate: 0.79

* 40k steps: Winrate: 0.84
There has probably been some API change inside PettingZoo at some point that caused this tutorial to break. Other tutorials that depend on AECEnv such as the AgileRL tutorial are also affected.

Wow, this helped solve my issues. Such a weird bug, thanks for helping tracking it down. We should probably open a PR with those changes, @MischaU8 do you want to be the one who does it?

MischaU8 commented 2 months ago

Great to hear that this solved the problem for you as well!

While the reward change fixes some things, I don't understand why this is needed at all. Assuming that this tutorial has worked correctly at some point, at a later moment there has been an API change in gym/gymnasium/pettingzoo/connect four that caused this break. As I don't understand the history of this problem, I don't feel comfortable creating a PR to address this.

Farama-Foundation / PettingZoo