SB3ActionMaskWrapper.step() is intended to be compatible with Gymnansium's
interface where step() returns observation, reward, termination, truncation, info
This was implemented using the last() function. But this returns the values for
the current agent, not the agent that just acted as Gymnasium would.
Among other things, this trains on the opponent's reward, encouraging bad play.
The function now returns the reward, termination, truncation, info values for the
agent that just acted. It still returns the observation for the next agent since
it is used to determine the next action.
Fixes #1147
Type of change
Bug fix (non-breaking change which fixes an issue)
Checklist:
[x] I have run the pre-commit checks with pre-commit run --all-files (see CONTRIBUTING.md instructions to set it up)
[x] I have run pytest -v and no errors are present.
[ ] I have commented my code, particularly in hard-to-understand areas
[ ] I have made corresponding changes to the documentation
[ ] I solved any possible warnings that pytest -v has generated that are related to my code to the best of my knowledge.
[ ] I have added tests that prove my fix is effective or that my feature works
[x] New and existing unit tests pass locally with my changes
Description
SB3ActionMaskWrapper.step() is intended to be compatible with Gymnansium's interface where step() returns observation, reward, termination, truncation, info
This was implemented using the last() function. But this returns the values for the current agent, not the agent that just acted as Gymnasium would.
Among other things, this trains on the opponent's reward, encouraging bad play.
The function now returns the reward, termination, truncation, info values for the agent that just acted. It still returns the observation for the next agent since it is used to determine the next action.
Fixes #1147
Type of change
Checklist:
pre-commit
checks withpre-commit run --all-files
(seeCONTRIBUTING.md
instructions to set it up)pytest -v
and no errors are present.pytest -v
has generated that are related to my code to the best of my knowledge.