Closed InigoMoreno closed 2 months ago
Thanks for letting me know, unfortunately I don’t have the bandwidth to dive into the details and fix this, have gotten some other comments about it in the discord server as well. Am planning to add a disclaimer that it’s mainly a proof of concept and adaptation of an old medium article that people had asked about a lot. That article had an order of a million steps for training whereas for my case I only had a laptop and did ten to twenty thousand.
Those notes in the code were basically just for my own sake testing out hyper parameters, but I’m fairly sure that I just got lucky because it should take a lot longer to train a decent policy. With random seeds we do have GitHub actions CI tests that check some basic results like that the training reaches less than for one env, greater than 50% for another env, etc. But it’s hard to guarantee anything for other seeds and I’ve had a number of people say they didn’t find things working very well, if you or anyone else seeing this has ideas for improvements or bug fixes feel free to suggest them. Maybe you could test training it for longer with different hyper parameters to see if it can converge on a good policy eventually?
The unexpected winrates happen because after calling env.step
on a PettingZoo's AECEnv
, the env.agent_selection
is set to the next agent, at least for connect_four_v3
.
https://github.com/Farama-Foundation/PettingZoo/blob/849414dfcba7f423f4db02bfecc7ef061cccc567/pettingzoo/classic/connect_four/connect_four.py#L195-L205
So when you call env.last
after a step, it will return the reward etc from the point of view of the next agent, not the one taking the step. This causes model.learn
to learn how to lose instead of winning.
After patching SB3ActionMaskWrapper.step
to return the negative reward, the model returns results that are in line with the "Evaluation/training hyperparameter notes" from the comments.
def step(self, action):
"""Gymnasium-like step function, returning observation, reward, termination, truncation, info."""
super().step(action)
obs, reward, termination, truncation, info = super().last()
return obs, -reward, termination, truncation, info
Before:
After:
There has probably been some API change inside PettingZoo at some point that caused this tutorial to break. Other tutorials that depend on AECEnv such as the AgileRL tutorial are also affected.
The unexpected winrates happen because after calling
env.step
on a PettingZoo'sAECEnv
, theenv.agent_selection
is set to the next agent, at least forconnect_four_v3
.So when you call
env.last
after a step, it will return the reward etc from the point of view of the next agent, not the one taking the step. This causesmodel.learn
to learn how to lose instead of winning.After patching
SB3ActionMaskWrapper.step
to return the negative reward, the model returns results that are in line with the "Evaluation/training hyperparameter notes" from the comments.def step(self, action): """Gymnasium-like step function, returning observation, reward, termination, truncation, info.""" super().step(action) obs, reward, termination, truncation, info = super().last() return obs, -reward, termination, truncation, info
Before:
* 10k steps: Winrate: 0.46 * 20k steps: Winrate: 0.35 * 40k steps: Winrate: 0.21
After:
* 10k steps: Winrate: 0.77 * 20k steps: Winrate: 0.79 * 40k steps: Winrate: 0.84
There has probably been some API change inside PettingZoo at some point that caused this tutorial to break. Other tutorials that depend on AECEnv such as the AgileRL tutorial are also affected.
Wow, this helped solve my issues. Such a weird bug, thanks for helping tracking it down. We should probably open a PR with those changes, @MischaU8 do you want to be the one who does it?
Great to hear that this solved the problem for you as well!
While the reward change fixes some things, I don't understand why this is needed at all. Assuming that this tutorial has worked correctly at some point, at a later moment there has been an API change in gym/gymnasium/pettingzoo/connect four that caused this break. As I don't understand the history of this problem, I don't feel comfortable creating a PR to address this.
Describe the bug
I am trying to run the connect four tutorial with SB3 (https://github.com/Farama-Foundation/PettingZoo/blob/master/tutorials/SB3/connect_four/sb3_connect_four_action_mask.py). In the tutorials the following winrates are reported:
But I get the following:
As you can see, the agent is not learning, it's even losing to the random agent.
Code example
https://github.com/Farama-Foundation/PettingZoo/blob/master/tutorials/SB3/connect_four/sb3_connect_four_action_mask.py
System info
Additional context
No response
Checklist