Description

SB3ActionMaskWrapper.step() is intended to be compatible with Gymnansium's interface where step() returns observation, reward, termination, truncation, info

This was implemented using the last() function. But this returns the values for the current agent, not the agent that just acted as Gymnasium would.

Among other things, this trains on the opponent's reward, encouraging bad play.

The function now returns the reward, termination, truncation, info values for the agent that just acted. It still returns the observation for the next agent since it is used to determine the next action.

Fixes #1147

Type of change

Bug fix (non-breaking change which fixes an issue)

Checklist:

[x] I have run the pre-commit checks with pre-commit run --all-files (see CONTRIBUTING.md instructions to set it up)
[x] I have run pytest -v and no errors are present.
[ ] I have commented my code, particularly in hard-to-understand areas
[ ] I have made corresponding changes to the documentation
[ ] I solved any possible warnings that pytest -v has generated that are related to my code to the best of my knowledge.
[ ] I have added tests that prove my fix is effective or that my feature works
[x] New and existing unit tests pass locally with my changes

Farama-Foundation / PettingZoo

Fix bug in SB3 tutorial ActionMask #1203

Description

Type of change

Checklist: