Farama-Foundation / PettingZoo

An API standard for multi-agent reinforcement learning environments, with popular reference environments and related utilities
https://pettingzoo.farama.org
Other
2.45k stars 400 forks source link

[Bug Report] TerminateIllegalWrapper breaks agent selection #1176

Closed dm-ackerman closed 1 week ago

dm-ackerman commented 4 months ago

Describe the bug

When running an env using TerminateIllegalWrapper and not using the action mask, the agent selection becomes corrupted when an illegal move is made.

Here is an example from tictactoe (code below). Notice in the first game that player 1 starts (as expected) and it alternates between players 1 and 2 (as expected) until player 1 makes an illegal move, which is caught by the wrapper. However, in the second game, player 1 makes two moves in a row. That should not happen. Also note that the illegal move flagged is not actually illegal per the game rules. This behaviour has been reported for other games.

New game
--------
calling reset(seed=42)
player_1 is making action: 1 current board:[0, 0, 0, 0, 0, 0, 0, 0, 0]
player_2 is making action: 5 current board:[0, 1, 0, 0, 0, 0, 0, 0, 0]
player_1 is making action: 7 current board:[0, 1, 0, 0, 0, 2, 0, 0, 0]
player_2 is making action: 4 current board:[0, 1, 0, 0, 0, 2, 0, 1, 0]
player_1 is making action: 1 current board:[0, 1, 0, 0, 2, 2, 0, 1, 0]
[WARNING]: Illegal move made, game terminating with current player losing. 
obs['action_mask'] contains a mask of all legal moves that can be chosen.

New game
--------
calling reset(seed=42)
player_1 is making action: 5 current board:[0, 0, 0, 0, 0, 0, 0, 0, 0]
player_1 is making action: 0 current board:[0, 0, 0, 0, 0, 1, 0, 0, 0]
[WARNING]: Illegal move made, game terminating with current player losing. 
obs['action_mask'] contains a mask of all legal moves that can be chosen.

Code example

from pettingzoo.classic import tictactoe_v3

env = tictactoe_v3.env()

def do_game(seed):
    print("\nNew game")
    print("--------")

    print(f"calling reset(seed={seed})")
    env.reset(seed)
    for agent in env.agent_iter():
        observation, reward, termination, truncation, info = env.last()

        if termination or truncation:
            env.step(None)
        else:
            mask = observation["action_mask"]
            # this is where you would insert your policy
            action = env.action_space(agent).sample()  # **no action_mask applied**
            print(f"{env.agent_selection} is making action: {action} current board:{env.board}")
            env.step(action)

do_game(42)
do_game(42)

System info

>>> import sys; sys.version
'3.9.12 (main, Apr  5 2022, 06:56:58) \n[GCC 7.5.0]'
>>> pettingzoo.__version__
'1.24.3'

Additional context

No response

Checklist

dm-ackerman commented 4 months ago

The problem here is that env variables, including agent_selection, are set by calls from TerminateIllegalWrapper to env functions. However, they are called by the wrapper object, not the env so they are set in the wrapper object rather than the base env object. When the code later tries to run, the values get updated in the env code, but the wrapper pulls it's own values that shadow them.

There are several ways to fix this. I think the most reliable and robust fix is to ensure that base wrapper has all methods from AECEnv and redirects them to be called by the env.

An alternate option is to change TerminateIllegalWrapper to call the method from the unwrapped env, but that is less general because it relies on other wrappers to do that. Alternatively, agent_selection and other values can be set as properties in BaseWrapper, but that requires more items added to BaseWrapper.

dm-ackerman commented 4 months ago

(for later reference when testing the fix) Code that shows the problem:

from pettingzoo.classic import tictactoe_v3
env = tictactoe_v3.env()

def show_agent_selector(env):
    unwrapped = env
    while type(unwrapped) != type(env.unwrapped):
        # the actual value for this wrapper (or a placeholder if no value)
        agent_here = unwrapped.__dict__.get("agent_selection", "--------")
        print(f"agent_selection is {agent_here} in wrapper: {unwrapped.__class__}")
        unwrapped = unwrapped.env

    agent_here = env.unwrapped.__dict__.get("agent_selection", "------")
    print(f"agent_selection is {agent_here} in env    : {unwrapped.__class__}")

def do_game(seed):
    print("\nNew game\n")

    print(f"calling reset(seed={seed})")
    env.reset(seed)
    for agent in env.agent_iter():
        observation, reward, termination, truncation, info = env.last()

        if termination or truncation:
            env.step(None)
        else:
            mask = observation["action_mask"]
            # this is where you would insert your policy
            action = env.action_space(agent).sample()
            print(f"{env.agent_selection} is making action: {action} current board:{env.board}")
            env.step(action)

show_agent_selector(env)
do_game(42)
show_agent_selector(env)
do_game(42)
show_agent_selector(env)

Output Before the game only raw_env had agent_selection defined. After the illegal move, the illegal wrapper had it defined as well which didn't get reset when resetting. A fix will prevent the wrapper from defining its own agent_selection (and any other variables that should be in raw_env)

agent_selection is -------- in wrapper: <class 'pettingzoo.utils.wrappers.order_enforcing.OrderEnforcingWrapper'> agent_selection is -------- in wrapper: <class 'pettingzoo.utils.wrappers.assert_out_of_bounds.AssertOutOfBoundsWrapper'> agent_selection is -------- in wrapper: <class 'pettingzoo.utils.wrappers.terminate_illegal.TerminateIllegalWrapper'> agent_selection is player_1 in env : <class 'pettingzoo.classic.tictactoe.tictactoe.raw_env'>

New game

calling reset(seed=42) player_1 is making action: 3 current board:[0, 0, 0, 0, 0, 0, 0, 0, 0] player_2 is making action: 6 current board:[0, 0, 0, 1, 0, 0, 0, 0, 0] player_1 is making action: 7 current board:[0, 0, 0, 1, 0, 0, 2, 0, 0] player_2 is making action: 4 current board:[0, 0, 0, 1, 0, 0, 2, 1, 0] player_1 is making action: 8 current board:[0, 0, 0, 1, 2, 0, 2, 1, 0] player_2 is making action: 4 current board:[0, 0, 0, 1, 2, 0, 2, 1, 1] [WARNING]: Illegal move made, game terminating with current player losing. obs['action_mask'] contains a mask of all legal moves that can be chosen. agent_selection is -------- in wrapper: <class 'pettingzoo.utils.wrappers.order_enforcing.OrderEnforcingWrapper'> agent_selection is -------- in wrapper: <class 'pettingzoo.utils.wrappers.assert_out_of_bounds.AssertOutOfBoundsWrapper'> agent_selection is player_2 in wrapper: <class 'pettingzoo.utils.wrappers.terminate_illegal.TerminateIllegalWrapper'> agent_selection is player_2 in env : <class 'pettingzoo.classic.tictactoe.tictactoe.raw_env'>

New game

calling reset(seed=42) player_2 is making action: 0 current board:[0, 0, 0, 0, 0, 0, 0, 0, 0] [WARNING]: Illegal move made, game terminating with current player losing. obs['action_mask'] contains a mask of all legal moves that can be chosen. agent_selection is -------- in wrapper: <class 'pettingzoo.utils.wrappers.order_enforcing.OrderEnforcingWrapper'> agent_selection is -------- in wrapper: <class 'pettingzoo.utils.wrappers.assert_out_of_bounds.AssertOutOfBoundsWrapper'> agent_selection is player_2 in wrapper: <class 'pettingzoo.utils.wrappers.terminate_illegal.TerminateIllegalWrapper'> agent_selection is player_1 in env : <class 'pettingzoo.classic.tictactoe.tictactoe.raw_env'>

elliottower commented 3 months ago

I think there should be a good way to fix this, will discuss in the other PR or make a new one

elliottower commented 3 months ago

Thanks for the detailed code we can adapt this as a test case to ensure it doesn’t break in the future