Stanford-ILIAD / PantheonRL

PantheonRL is a package for training and testing multi-agent reinforcement learning environments. PantheonRL supports cross-play, fine-tuning, ad-hoc coordination, and more.
MIT License
121 stars 18 forks source link

Overcooked and OffPolicyAgent #12

Open ConstantinRuhdorfer opened 1 year ago

ConstantinRuhdorfer commented 1 year ago

Hi,

I adapted the simple example to use

import gym
from overcookedgym.overcooked_utils import LAYOUT_LIST
from pantheonrl.common.agents import OnPolicyAgent, OffPolicyAgent
from stable_baselines3 import PPO, DQN

layout = "simple"
assert layout in LAYOUT_LIST
print(f"Using layout: {layout} from {LAYOUT_LIST}")

env = gym.make("OvercookedMultiEnv-v0", layout_name=layout)

partner = OffPolicyAgent(DQN("MlpPolicy", env, verbose=1))
env.add_partner_agent(partner)

ego = DQN("MlpPolicy", env, verbose=1)
ego.learn(total_timesteps=1000)

Just to test OffPolicyAgent but I keep getting:

Traceback (most recent call last):
  File "/projects/ruhdorfer/msc2023_constantin/src/scripts/train_simple_overcooked.py", line 31, in <module>
    ego.learn(total_timesteps=1000)
  File "/projects/ruhdorfer/msc2023_constantin/venv/lib/python3.10/site-packages/stable_baselines3/dqn/dqn.py", line 269, in learn
    return super().learn(
  File "/projects/ruhdorfer/msc2023_constantin/venv/lib/python3.10/site-packages/stable_baselines3/common/off_policy_algorithm.py", line 311, in learn
    rollout = self.collect_rollouts(
  File "/projects/ruhdorfer/msc2023_constantin/venv/lib/python3.10/site-packages/stable_baselines3/common/off_policy_algorithm.py", line 543, in collect_rollouts
    new_obs, rewards, dones, infos = env.step(actions)
  File "/projects/ruhdorfer/msc2023_constantin/venv/lib/python3.10/site-packages/stable_baselines3/common/vec_env/base_vec_env.py", line 163, in step
    return self.step_wait()
  File "/projects/ruhdorfer/msc2023_constantin/venv/lib/python3.10/site-packages/stable_baselines3/common/vec_env/dummy_vec_env.py", line 54, in step_wait
    obs, self.buf_rews[env_idx], self.buf_dones[env_idx], self.buf_infos[env_idx] = self.envs[env_idx].step(
  File "/projects/ruhdorfer/msc2023_constantin/venv/lib/python3.10/site-packages/stable_baselines3/common/monitor.py", line 95, in step
    observation, reward, done, info = self.env.step(action)
  File "/projects/ruhdorfer/msc2023_constantin/venv/lib/python3.10/site-packages/gym/wrappers/order_enforcing.py", line 11, in step
    observation, reward, done, info = self.env.step(action)
  File "/projects/ruhdorfer/PantheonRL/pantheonrl/common/multiagentenv.py", line 195, in step
    acts = self._get_actions(self._players, self._obs, action)
  File "/projects/ruhdorfer/PantheonRL/pantheonrl/common/multiagentenv.py", line 157, in _get_actions
    actions.append(agent.get_action(ob))
  File "/projects/ruhdorfer/PantheonRL/pantheonrl/common/agents.py", line 263, in get_action
    self.model._store_transition(
  File "/projects/ruhdorfer/msc2023_constantin/venv/lib/python3.10/site-packages/stable_baselines3/common/off_policy_algorithm.py", line 455, in _store_transition
    for i, done in enumerate(dones):
TypeError: 'bool' object is not iterable

This seems to be due to the fact that SB3 is expecting multiple dones from env.step in stable_baselines3/common/off_policy_algorithm.py:544: new_obs, rewards, dones, infos = env.step(actions) where Overcooked only returns a single done in overcookedgym/overcooked.py:80.

Are off policy algorithms not supported? Is there a good way of fixing this, i.e. by changing line 80 from

return (ego_obs, alt_obs), (reward, reward), done, {}#info

to

return (ego_obs, alt_obs), (reward, reward), [done], {}#info

?

Thank you!

Cheers, Constantin

ConstantinRuhdorfer commented 1 year ago

Hi, I can confirm that simply changing line 80 in multi_step in overcookedgym/overcooked.py from:

return (ego_obs, alt_obs), (reward, reward), done, {}#info

to this

return (ego_obs, alt_obs), (reward, reward), [done], {}#info

fixes the issue and still works with OnPolicyAgent and PPO. I will open up a PR, can you maybe comment if this has any other implications? Thanks

ConstantinRuhdorfer commented 1 year ago

PR is here #14