[Bug Report] [rllib] RLlib tutorial works with leduc_holdem but not with tictactoe?

bug description Using Ray[rllib] 1.13.0 and pettingzoo 1.19.0, I'm having difficulty training DQN on the TicTacToe env. The simplest reproduction I've found is rllib_leduc_holdem.py, which runs fine; however, it will error when the environment is switched to TicTacToe:

ERROR trial_runner.py:886 -- Trial DQN_tictactoe_9dc71_00000: Error processing event.
NoneType: None

I'm new to both PettingZoo and RLlib, so I'm happy to hear advice on how to move forward. Is there any relevant difference in the TicTacToe and Leduc Hold`em environments for training with RLlib? Any reason we shouldn't use the same script?

Code example The following script is copied from the latest release's holdem with rllib tutorial. It has minor edits (see comments with "edit") to switch the environment to tictactoe_v3 and update some import names:

"""
rllib_leduc_holdem_example.py

copied from
https://github.com/Farama-Foundation/PettingZoo/blob/1.19.1/tutorials/rllib_leduc_holdem.py
modified slightly to use tictactoe env
"""

import os
from copy import deepcopy

import ray
from gym.spaces import Box
from ray import tune
from ray.rllib.agents.dqn.dqn_torch_model import DQNTorchModel
from ray.rllib.agents.registry import get_trainer_class  # edit: match ray-rllib 1.13.0 api
from ray.rllib.env import PettingZooEnv
from ray.rllib.models import ModelCatalog
from ray.rllib.models.torch.fcnet import FullyConnectedNetwork as TorchFC
from ray.rllib.utils.framework import try_import_torch
from ray.rllib.utils.torch_utils import FLOAT_MAX  # edit: match ray-rllib 1.13.0 api
from ray.tune.registry import register_env

# edit: switch to tictactoe
from pettingzoo.classic import tictactoe_v3

torch, nn = try_import_torch()

class TorchMaskedActions(DQNTorchModel):
    """PyTorch version of above ParametricActionsModel."""

    def __init__(self, obs_space, action_space, num_outputs, model_config, name, **kw):
        DQNTorchModel.__init__(
            self, obs_space, action_space, num_outputs, model_config, name, **kw
        )

        obs_len = obs_space.shape[0] - action_space.n

        orig_obs_space = Box(
            shape=(obs_len,), low=obs_space.low[:obs_len], high=obs_space.high[:obs_len]
        )
        self.action_embed_model = TorchFC(
            orig_obs_space,
            action_space,
            action_space.n,
            model_config,
            name + "_action_embed",
        )

    def forward(self, input_dict, state, seq_lens):
        # Extract the available actions tensor from the observation.
        action_mask = input_dict["obs"]["action_mask"]

        # Compute the predicted action embedding
        action_logits, _ = self.action_embed_model(
            {"obs": input_dict["obs"]["observation"]}
        )
        # turns probit action mask into logit action mask
        inf_mask = torch.clamp(torch.log(action_mask), -1e10, FLOAT_MAX)

        return action_logits + inf_mask, state

    def value_function(self):
        return self.action_embed_model.value_function()

if __name__ == "__main__":
    alg_name = "DQN"
    ModelCatalog.register_custom_model("pa_model", TorchMaskedActions)
    # function that outputs the environment you wish to register.

    def env_creator():
        env = tictactoe_v3.env()
        return env

    num_cpus = 1

    # edit: note: may want to change after TrainerConfigs introduced in Ray-RLlib
    # https://docs.ray.io/en/latest/rllib/rllib-training.html#common-parameters
    config = deepcopy(get_trainer_class(alg_name)._default_config)

    # edit: switch env
    register_env("tictactoe", lambda config: PettingZooEnv(env_creator()))

    test_env = PettingZooEnv(env_creator())
    obs_space = test_env.observation_space
    print(obs_space)
    act_space = test_env.action_space

    config["multiagent"] = {
        "policies": {
            "player_0": (None, obs_space, act_space, {}),
            "player_1": (None, obs_space, act_space, {}),
        },
        "policy_mapping_fn": lambda agent_id: agent_id,
    }

    config["num_gpus"] = int(os.environ.get("RLLIB_NUM_GPUS", "0"))
    config["log_level"] = "DEBUG"
    config["num_workers"] = 1
    config["rollout_fragment_length"] = 30
    config["train_batch_size"] = 200
    config["horizon"] = 200
    config["no_done_at_end"] = False
    config["framework"] = "torch"
    config["model"] = {
        "custom_model": "pa_model",
    }
    config["n_step"] = 1

    config["exploration_config"] = {
        # The Exploration class to use.
        "type": "EpsilonGreedy",
        # Config for the Exploration class' constructor:
        "initial_epsilon": 0.1,
        "final_epsilon": 0.0,
        "epsilon_timesteps": 100000,  # Timesteps over which to anneal epsilon.
    }
    config["hiddens"] = []
    config["dueling"] = False
    config["env"] = "tictactoe"  # edit

    ray.init(num_cpus=num_cpus + 1)

    tune.run(
        alg_name,
        name="DQN",
        stop={"timesteps_total": 10000000},
        checkpoint_freq=10,
        config=config,
    )

Log and truncated traceback (apologies for the long section, scroll to end for error):

Dict(action_mask:Box([0 0 0 0 0 0 0 0 0], [1 1 1 1 1 1 1 1 1], (9,), int8), observation:Box([[[0 0]
  [0 0]
  [0 0]]

 [[0 0]
  [0 0]
  [0 0]]

 [[0 0]
  [0 0]
  [0 0]]], [[[1 1]
  [1 1]
  [1 1]]

 [[1 1]
  [1 1]
  [1 1]]

 [[1 1]
  [1 1]
  [1 1]]], (3, 3, 2), int8))
(pid=5435) 
(DQNTrainer pid=5440) 2022-07-24 13:47:03,168   INFO simple_q.py:187 -- In multi-agent mode, policies will be optimized sequentially by the multi-GPU optimizer. Consider setting `simple_optimizer=True` if this doesn't work for you.
(bundle_reservation_check_func pid=5436) 
(RolloutWorker pid=5448) 2022-07-24 13:47:06,064    WARNING env.py:42 -- Skipping env checking for this experiment
(RolloutWorker pid=5448) 2022-07-24 13:47:06,064    DEBUG rollout_worker.py:1770 -- Creating policy for player_0
(RolloutWorker pid=5448) 2022-07-24 13:47:06,065    DEBUG preprocessors.py:269 -- Creating sub-preprocessor for Box([0 0 0 0 0 0 0 0 0], [1 1 1 1 1 1 1 1 1], (9,), int8)
(RolloutWorker pid=5448) 2022-07-24 13:47:06,075    DEBUG preprocessors.py:269 -- Creating sub-preprocessor for Box([[[0 0]
[...]
(RolloutWorker pid=5448)   [1 1]]], (3, 3, 2), int8)
(RolloutWorker pid=5448) 2022-07-24 13:47:06,076    DEBUG catalog.py:805 -- Created preprocessor <ray.rllib.models.preprocessors.DictFlatteningPreprocessor object at 0x7fb61f8ec130>: Dict(action_mask:Box([0 0 0 0 0 0 0 0 0], [1 1 1 1 1 1 1 1 1], (9,), int8), observation:Box([[[0 0]
[...]
(RolloutWorker pid=5448)   [1 1]]], (3, 3, 2), int8)) -> (27,)
(RolloutWorker pid=5448) 2022-07-24 13:47:06,079    INFO catalog.py:474 -- Wrapping <class '__main__.TorchMaskedActions'> as <class 'ray.rllib.agents.dqn.dqn_torch_model.DQNTorchModel'>
(RolloutWorker pid=5448) 2022-07-24 13:47:06,131    INFO catalog.py:474 -- Wrapping <class '__main__.TorchMaskedActions'> as <class 'ray.rllib.agents.dqn.dqn_torch_model.DQNTorchModel'>
(RolloutWorker pid=5448) 2022-07-24 13:47:06,136    INFO torch_policy.py:190 -- TorchPolicy (worker=1) running on CPU.
(RolloutWorker pid=5448) 2022-07-24 13:47:06,152    DEBUG preprocessors.py:269 -- Creating sub-preprocessor for Box([0 0 0 0 0 0 0 0 0], [1 1 1 1 1 1 1 1 1], (9,), int8)
(RolloutWorker pid=5448) 2022-07-24 13:47:06,153    DEBUG preprocessors.py:269 -- Creating sub-preprocessor for Box([[[0 0]
[...]
(RolloutWorker pid=5448)   [1 1]]], (3, 3, 2), int8)
== Status ==
Current time: 2022-07-24 13:47:06 (running for 00:00:07.36)
Memory usage on this node: 10.1/16.0 GiB
Using FIFO scheduling algorithm.
Resources requested: 2.0/2 CPUs, 0/0 GPUs, 0.0/4.16 GiB heap, 0.0/2.0 GiB objects
Result logdir: /.../ray_results/DQN
Number of trials: 1/1 (1 RUNNING)
+---------------------------+----------+----------------+
| Trial name                | status   | loc            |
|---------------------------+----------+----------------|
| DQN_tictactoe_9dc71_00000 | RUNNING  | 127.0.0.1:5440 |
+---------------------------+----------+----------------+

(DQNTrainer pid=5440) 2022-07-24 13:47:06,279   INFO worker_set.py:162 -- Inferred observation/action spaces from remote worker (local worker has no env): {'player_1': (Dict(action_mask:Box([0 0 0 0 0 0 0 0 0], [1 1 1 1 1 1 1 1 1], (9,), int8), observation:Box([[[0 0]
[..]
(DQNTrainer pid=5440)   [1 1]]], (3, 3, 2), int8)), Discrete(9)), 'player_0': (Dict(action_mask:Box([0 0 0 0 0 0 0 0 0], [1 1 1 1 1 1 1 1 1], (9,), int8), observation:Box([[[0 0]
[...]
(DQNTrainer pid=5440)   [1 1]]], (3, 3, 2), int8)), Discrete(9)), '__env__': (Dict(action_mask:Box([0 0 0 0 0 0 0 0 0], [1 1 1 1 1 1 1 1 1], (9,), int8), observation:Box([[[0 0]
[...]
(DQNTrainer pid=5440)   [1 1]]], (3, 3, 2), int8)), Discrete(9))}
(DQNTrainer pid=5440) 2022-07-24 13:47:06,279   DEBUG rollout_worker.py:1770 -- Creating policy for player_0
(DQNTrainer pid=5440) 2022-07-24 13:47:06,280   DEBUG preprocessors.py:269 -- Creating sub-preprocessor for Box([0 0 0 0 0 0 0 0 0], [1 1 1 1 1 1 1 1 1], (9,), int8)
(DQNTrainer pid=5440) 2022-07-24 13:47:06,281   DEBUG preprocessors.py:269 -- Creating sub-preprocessor for Box([[[0 0]
[...]
(DQNTrainer pid=5440)   [1 1]]], (3, 3, 2), int8)
(DQNTrainer pid=5440) 2022-07-24 13:47:06,282   DEBUG catalog.py:805 -- Created preprocessor <ray.rllib.models.preprocessors.DictFlatteningPreprocessor object at 0x7fe804d98850>: Dict(action_mask:Box([0 0 0 0 0 0 0 0 0], [1 1 1 1 1 1 1 1 1], (9,), int8), observation:Box([[[0 0]
[...]
(RolloutWorker pid=5448) 2022-07-24 13:47:06,251    DEBUG rollout_worker.py:1770 -- Creating policy for player_1
(RolloutWorker pid=5448) 2022-07-24 13:47:06,251    DEBUG preprocessors.py:269 -- Creating sub-preprocessor for Box([0 0 0 0 0 0 0 0 0], [1 1 1 1 1 1 1 1 1], (9,), int8)
(RolloutWorker pid=5448) 2022-07-24 13:47:06,252    DEBUG preprocessors.py:269 -- Creating sub-preprocessor for Box([[[0 0]
[...]
(RolloutWorker pid=5448)   [1 1]]], (3, 3, 2), int8)
(RolloutWorker pid=5448) 2022-07-24 13:47:06,252    DEBUG catalog.py:805 -- Created preprocessor <ray.rllib.models.preprocessors.DictFlatteningPreprocessor object at 0x7fb61f8eca90>: Dict(action_mask:Box([0 0 0 0 0 0 0 0 0], [1 1 1 1 1 1 1 1 1], (9,), int8), observation:Box([[[0 0]
[...]
(RolloutWorker pid=5448)   [1 1]]], (3, 3, 2), int8)) -> (27,)
(RolloutWorker pid=5448) 2022-07-24 13:47:06,253    INFO catalog.py:474 -- Wrapping <class '__main__.TorchMaskedActions'> as <class 'ray.rllib.agents.dqn.dqn_torch_model.DQNTorchModel'>
(RolloutWorker pid=5448) 2022-07-24 13:47:06,256    INFO catalog.py:474 -- Wrapping <class '__main__.TorchMaskedActions'> as <class 'ray.rllib.agents.dqn.dqn_torch_model.DQNTorchModel'>
(RolloutWorker pid=5448) 2022-07-24 13:47:06,259    INFO torch_policy.py:190 -- TorchPolicy (worker=1) running on CPU.
(RolloutWorker pid=5448) 2022-07-24 13:47:06,264    DEBUG rollout_worker.py:783 -- Created rollout worker with env <ray.rllib.env.multi_agent_env.MultiAgentEnvWrapper object at 0x7fb61efaa250> (<PettingZooEnv instance>), policies {}
(RolloutWorker pid=5448) 2022-07-24 13:47:06,333    INFO rollout_worker.py:819 -- Generating sample batch of size 30
(DQNTrainer pid=5440)   [1 1]]
(DQNTrainer pid=5440) 
(DQNTrainer pid=5440)  [[1 1]
(DQNTrainer pid=5440)   [1 1]
(DQNTrainer pid=5440)   [1 1]]], (3, 3, 2), int8)) -> (27,)
(DQNTrainer pid=5440) 2022-07-24 13:47:06,284   INFO catalog.py:474 -- Wrapping <class '__main__.TorchMaskedActions'> as <class 'ray.rllib.agents.dqn.dqn_torch_model.DQNTorchModel'>
(DQNTrainer pid=5440) 2022-07-24 13:47:06,287   INFO catalog.py:474 -- Wrapping <class '__main__.TorchMaskedActions'> as <class 'ray.rllib.agents.dqn.dqn_torch_model.DQNTorchModel'>
(DQNTrainer pid=5440) 2022-07-24 13:47:06,290   INFO torch_policy.py:190 -- TorchPolicy (worker=local) running on CPU.
(DQNTrainer pid=5440) 2022-07-24 13:47:06,292   DEBUG preprocessors.py:269 -- Creating sub-preprocessor for Box([0 0 0 0 0 0 0 0 0], [1 1 1 1 1 1 1 1 1], (9,), int8)
(DQNTrainer pid=5440) 2022-07-24 13:47:06,293   DEBUG preprocessors.py:269 -- Creating sub-preprocessor for Box([[[0 0]
[...]
(DQNTrainer pid=5440)   [1 1]]], (3, 3, 2), int8)
(DQNTrainer pid=5440) 2022-07-24 13:47:06,298   DEBUG rollout_worker.py:1770 -- Creating policy for player_1
(DQNTrainer pid=5440) 2022-07-24 13:47:06,300   DEBUG preprocessors.py:269 -- Creating sub-preprocessor for Box([0 0 0 0 0 0 0 0 0], [1 1 1 1 1 1 1 1 1], (9,), int8)
(DQNTrainer pid=5440) 2022-07-24 13:47:06,300   DEBUG preprocessors.py:269 -- Creating sub-preprocessor for Box([[[0 0]
[...]
(DQNTrainer pid=5440)   [1 1]]], (3, 3, 2), int8)
(DQNTrainer pid=5440) 2022-07-24 13:47:06,301   DEBUG catalog.py:805 -- Created preprocessor <ray.rllib.models.preprocessors.DictFlatteningPreprocessor object at 0x7fe804d98df0>: Dict(action_mask:Box([0 0 0 0 0 0 0 0 0], [1 1 1 1 1 1 1 1 1], (9,), int8), observation:Box([[[0 0]
[...]
(DQNTrainer pid=5440)   [1 1]]], (3, 3, 2), int8)) -> (27,)
(DQNTrainer pid=5440) 2022-07-24 13:47:06,302   INFO catalog.py:474 -- Wrapping <class '__main__.TorchMaskedActions'> as <class 'ray.rllib.agents.dqn.dqn_torch_model.DQNTorchModel'>
(DQNTrainer pid=5440) 2022-07-24 13:47:06,306   INFO catalog.py:474 -- Wrapping <class '__main__.TorchMaskedActions'> as <class 'ray.rllib.agents.dqn.dqn_torch_model.DQNTorchModel'>
(DQNTrainer pid=5440) 2022-07-24 13:47:06,308   INFO torch_policy.py:190 -- TorchPolicy (worker=local) running on CPU.
(DQNTrainer pid=5440) 2022-07-24 13:47:06,313   INFO rollout_worker.py:1793 -- Built policy map: {}
(DQNTrainer pid=5440) 2022-07-24 13:47:06,313   INFO rollout_worker.py:1794 -- Built preprocessor map: {'player_0': <ray.rllib.models.preprocessors.DictFlatteningPreprocessor object at 0x7fe804d98850>, 'player_1': <ray.rllib.models.preprocessors.DictFlatteningPreprocessor object at 0x7fe804d98df0>}
(DQNTrainer pid=5440) 2022-07-24 13:47:06,313   INFO rollout_worker.py:670 -- Built filter map: {'player_0': <ray.rllib.utils.filter.NoFilter object at 0x7fe804d98820>, 'player_1': <ray.rllib.utils.filter.NoFilter object at 0x7fe804e61f40>}
(DQNTrainer pid=5440) 2022-07-24 13:47:06,313   DEBUG rollout_worker.py:783 -- Created rollout worker with env None (None), policies {}
(DQNTrainer pid=5440) 2022-07-24 13:47:06,321   WARNING util.py:65 -- Install gputil for GPU system monitoring.
(RolloutWorker pid=5448) 2022-07-24 13:47:06,335    INFO sampler.py:664 -- Raw obs from env: { 0: { 'player_1': { 'action_mask': np.ndarray((9,), dtype=int8, min=1.0, max=1.0, mean=1.0),
(RolloutWorker pid=5448)                      'observation': np.ndarray((3, 3, 2), dtype=int8, min=0.0, max=0.0, mean=0.0)}}}
(RolloutWorker pid=5448) 2022-07-24 13:47:06,335    INFO sampler.py:665 -- Info return from env: {0: {}}
(RolloutWorker pid=5448) 2022-07-24 13:47:06,336    WARNING deprecation.py:46 -- DeprecationWarning: `policy_mapping_fn(agent_id)` has been deprecated. Use `policy_mapping_fn(agent_id, episode, worker, **kwargs)` instead. This will raise an error in the future!
(RolloutWorker pid=5448) 2022-07-24 13:47:06,340    INFO sampler.py:900 -- Preprocessed obs: np.ndarray((27,), dtype=float32, min=0.0, max=1.0, mean=0.333)
(RolloutWorker pid=5448) 2022-07-24 13:47:06,340    INFO sampler.py:905 -- Filtered obs: np.ndarray((27,), dtype=float32, min=0.0, max=1.0, mean=0.333)
(RolloutWorker pid=5448) 2022-07-24 13:47:06,343    INFO sampler.py:1135 -- Inputs to compute_actions():
(RolloutWorker pid=5448) 
(RolloutWorker pid=5448) { 'player_1': [ { 'data': { 'agent_id': 'player_1',
(RolloutWorker pid=5448)                             'env_id': 0,
(RolloutWorker pid=5448)                             'info': {},
(RolloutWorker pid=5448)                             'obs': np.ndarray((27,), dtype=float32, min=0.0, max=1.0, mean=0.333),
(RolloutWorker pid=5448)                             'prev_action': None,
(RolloutWorker pid=5448)                             'prev_reward': 0.0,
(RolloutWorker pid=5448)                             'rnn_state': None},
(RolloutWorker pid=5448)                   'type': 'PolicyEvalData'}]}
(RolloutWorker pid=5448) 
2022-07-24 13:47:06,385 ERROR trial_runner.py:886 -- Trial DQN_tictactoe_9dc71_00000: Error processing event.
NoneType: None

and later:

Resources requested: 0/2 CPUs, 0/0 GPUs, 0.0/4.16 GiB heap, 0.0/2.0 GiB objects
Result logdir: /.../ray_results
Number of trials: 1/1 (1 ERROR)
+---------------------------+----------+----------------+
| Trial name                | status   | loc            |
|---------------------------+----------+----------------|
| DQN_tictactoe_9dc71_00000 | ERROR    | 127.0.0.1:5440 |
+---------------------------+----------+----------------+
Number of errored trials: 1
+---------------------------+--------------+---------------------------------------------------------------------------------------+
| Trial name                |   # failures | error file                                                                            |
|---------------------------+--------------+---------------------------------------------------------------------------------------|
| DQN_tictactoe_9dc71_00000 |            1 | /.../ray_results/DQN/DQN_tictactoe_9dc71_00000_0_2022-07-24_13-46-59/error.txt |
+---------------------------+--------------+---------------------------------------------------------------------------------------+

(DQNTrainer pid=5440) 2022-07-24 13:47:06,373   WARNING trainer.py:1124 -- Worker crashed during call to `step_attempt()`. To try to continue training without failed worker(s), set `ignore_worker_failures=True`. To try to recover the failed worker(s), set `recreate_failed_workers=True`.
(RolloutWorker pid=5448) 2022-07-24 13:47:06,357    INFO sampler.py:1161 -- Outputs of compute_actions():
(RolloutWorker pid=5448) 
(RolloutWorker pid=5448) { 'player_1': ( np.ndarray((1,), dtype=int32, min=0.0, max=0.0, mean=0.0),
(RolloutWorker pid=5448)                 [],
(RolloutWorker pid=5448)                 { 'action_dist_inputs': np.ndarray((1, 9), dtype=float32, min=0.0, max=0.0, mean=0.0),
(RolloutWorker pid=5448)                   'action_logp': np.ndarray((1,), dtype=float32, min=0.0, max=0.0, mean=0.0),
(RolloutWorker pid=5448)                   'action_prob': np.ndarray((1,), dtype=float32, min=1.0, max=1.0, mean=1.0),
(RolloutWorker pid=5448)                   'q_values': np.ndarray((1, 9), dtype=float32, min=0.0, max=0.0, mean=0.0)})}
(RolloutWorker pid=5448) 
Traceback (most recent call last):
  File "/.../rllib_leduc_holdem_example.py", line 122, in <module>
    tune.run(
  File "/.../ray/tune/tune.py", line 741, in run
    raise TuneError("Trials did not complete", incomplete_trials)
ray.tune.error.TuneError: ('Trials did not complete', [DQN_tictactoe_9dc71_00000])

Process finished with exit code 1

System Info

I am running an Intel mac, macOS 12.4

python env: using conda, with this env YAML:

channels:
- defaults
dependencies:
- python>=3.7,<3.10  # using 3.9.12
- pip>=3
- scikit-learn  # using 1.0.2
- seaborn
- pandas
- numpy
- scipy
- ipython
- jupyter
- nomkl  # avoid OpenMP issue (#15) on macOS/Intel hardware. used v3.0
- pip:
- torch  # have to use pip if installing nomkl. used v1.12.0
- gym  # used 0.21.0
- ray[rllib]==1.13
- tensorboard
- pettingzoo[classic]==1.19   # used 1.19 and not 1.19.1 since ray 1.13 doesn't support gym>=0.22
- supersuit
- stable-baselines3

The latest ray only supports up to gym 0.24 and pettingzoo actually wants gym 0.24.1 or later (only change between 1.19 and latest 1.19.1 is updating the gym requirement). I'm not sure if it should matter in this case, since the script runs fine for me on leduc_holdem.

Additional context Note that TicTacToe and Holdem are not positive sum games (which is a soft "requirement" of RLlib's PettingZooEnv wrapper). But this shouldn't matter for just getting something to run.

Checklist

[v] I have checked that there is no similar issue in the repo

There are several (+1 year old) discussions on RLlib/pettingzoo interfacing. Most recently there is a motion to adopt pettingzoo's api into Ray-RLlib's multi agent setting (see https://github.com/ray-project/ray/issues/23975#issue-1207213022), which I'm for. Obviously this issue would be impacted.

Farama-Foundation / PettingZoo

[Bug Report] [rllib] RLlib tutorial works with leduc_holdem but not with tictactoe? #742

Checklist