TonghanWang / RODE

Codes accompanying the paper "RODE: Learning Roles to Decompose Multi-Agent Tasks (ICLR 2021, https://arxiv.org/abs/2010.01523). RODE is a scalable role-based multi-agent learning method which effectively discovers roles based on joint action space decomposition according to action effects, establishing a new state of the art on the StarCraft multi-agent benchmark.
Apache License 2.0
69 stars 20 forks source link

Is this work categorized as CTDE? #4

Open mamadpierre opened 3 years ago

mamadpierre commented 3 years ago

As the definition goes, CTDE means, "The learning algorithm has access to all local action-observation histories and global state s, but each agent’s learnt policy can condition only on its own action-observation history."

In this work during the executions, RODEMAC.select_actions will be called and it contains RODEMAC.forward method without any changes compare to the time when RODEMAC.forward will be called during the training via learner module. This makes me doubt categorizing RODE as CTDE.

For instance when we are in the execution time, why should agents know that there are n_roles (for instance 3) roles out there? This is an extra information. In the training you can train agents to follow one of the roles, but not in the execution. For instance we consider 5m_vs_6m smac map, why should agents hard coded into three roles during their execution?

Or same story with role_latent. During the training you can train a latent space representing roles you defined extra to the problem. But during the execution still you select the roles based on this latent space and agents must follow one of the roles dependent on extra information. Let me be more exact here with the dimensions. During the training you train a space called, role_latent_reshaped in the class dot_selector, assuming you are using the dot_selector. This space has the shape of (bs * n_agents, n_roles, action_latent_dim). For 5m_vs_6m smac map, during execution (Considering bs=32) it has the shape of (160 = 32 * 5, 3, 20). In other words, you use a latent space which connects 5 agents to 3 roles even in the execution time. This is somewhat close to the mixing networks that should be turned off during execution. That (5 ---> 3) means 5 agents interact with each other to build the 3 roles and the learned latent space cannot be considered for the part of the CTDE definition which goes, "agent’s learnt policy can condition only on its own action-observation history." Not each other information.

To compare with another familiar algorithm, please look at the ROMA algorithm implementation. In ROMA, extra contributions are mainly in the agent module. During execution when we use ROMA.select_actions, variable test_mode=False turns off all the extra work of the agents and their interactions during the execution. But in RODE implementations RODEMAC.forward doesn't use such a variable when it is needed.

Thanks for your time and explanation

TonghanWang commented 3 years ago

Hi, thanks for your questions. RODE is a CTDE method.

  1. In RODEMAC.select_actions, role_outputs is returned but not used. This is the difference when compared to the case of learner.

  2. As for n_roles. You are right. We do not need to know it. We write our code like this just because we can conveniently implement our algorithm by torch function gather.

  3. About role_latent. The first dimension is for each agent (perhaps from different batches), and the last two dimensions are for role representations, which is the same for all agents. Agents do not need to know the action-observation history of others, because when bmm, the last two dimensions of inputs in forward function is local history.

mamadpierre commented 3 years ago

Thanks for your time and reply. I am not convinced though,

My concern is that in RODEMAC.select_actions, agent_outputs is obviously returned and used. And for building agent_outputs you use information that should be separated in time of execution versus time of learning but you don't separate them.

In the time of execution we are not allowed to introduce extra information or structure into the problem for agents to use. But in the time of execution you still run the below chunk of code located in RODEMAC.forward

self.hidden_states = self.agent(agent_inputs, self.hidden_states)
roles_q = []
for role_i in range(self.n_roles):
    role_q = self.roles[role_i](self.hidden_states, self.action_repr)  # [bs * n_agents, n_actions]
    roles_q.append(role_q)

roles_q = th.stack(roles_q, dim=1)  # [bs*n_agents, n_roles, n_actions]
agent_outs = th.gather(roles_q, 1, \
                        self.selected_roles.unsqueeze(-1).unsqueeze(-1).repeat(1, 1, self.n_actions))  (*****)

Let me give you some explanation with my understanding. agent_outs later corresponds to chosen_actions, so for simplicity when I talk about agent outputs here, I already assume their actions. Simultaneously, again for simplicity, imagine roles that the algorithm learns through the training are not complicated combination of possible actions, but instead, for any reason, they are simple single actions. For instance, the 3 roles that the algorithm learns are three simple actions of going up, going down and going left. What you do is to make sure each agent follow these three roles even in the execution time. In other words, you do not allow agents to do anything else like choosing action going right, when choosing action going right was learned to be a bad choice (in terms of roles) in the training phase.

This is not allowed I guess. Maybe an agent wants to ruin everything and just act randomly and goes right many times during execution.

In other words, are you allowed to impose structural limitation on agents' behaviors in the execution time (like obeying limited number of roles that have been learned through the training), based on learning procedure?


BTW, for your first bullet point, role_outputs is being used elsewhere in the RODEMAC.forward. It directly provides

self.selected_roles = self.role_selector.select_role(role_outputs, test_mode=test_mode, t_env=t_env).squeeze()

and self.selected_roles is used inside line (*****).