Open mamadpierre opened 3 years ago
Hi, thanks for your questions. RODE is a CTDE method.
In RODEMAC.select_actions
, role_outputs
is returned but not used. This is the difference when compared to the case of learner.
As for n_roles
. You are right. We do not need to know it. We write our code like this just because we can conveniently implement our algorithm by torch function gather
.
About role_latent
. The first dimension is for each agent (perhaps from different batches), and the last two dimensions are for role representations, which is the same for all agents. Agents do not need to know the action-observation history of others, because when bmm
, the last two dimensions of inputs
in forward
function is local history.
Thanks for your time and reply. I am not convinced though,
My concern is that in RODEMAC.select_actions
, agent_outputs
is obviously returned and used. And for building agent_outputs
you use information that should be separated in time of execution versus time of learning but you don't separate them.
In the time of execution we are not allowed to introduce extra information or structure into the problem for agents to use. But in the time of execution you still run the below chunk of code located in RODEMAC.forward
self.hidden_states = self.agent(agent_inputs, self.hidden_states)
roles_q = []
for role_i in range(self.n_roles):
role_q = self.roles[role_i](self.hidden_states, self.action_repr) # [bs * n_agents, n_actions]
roles_q.append(role_q)
roles_q = th.stack(roles_q, dim=1) # [bs*n_agents, n_roles, n_actions]
agent_outs = th.gather(roles_q, 1, \
self.selected_roles.unsqueeze(-1).unsqueeze(-1).repeat(1, 1, self.n_actions)) (*****)
Let me give you some explanation with my understanding. agent_outs
later corresponds to chosen_actions
, so for simplicity when I talk about agent outputs here, I already assume their actions. Simultaneously, again for simplicity, imagine roles that the algorithm learns through the training are not complicated combination of possible actions, but instead, for any reason, they are simple single actions. For instance, the 3 roles that the algorithm learns are three simple actions of going up, going down and going left. What you do is to make sure each agent follow these three roles even in the execution time. In other words, you do not allow agents to do anything else like choosing action going right, when choosing action going right was learned to be a bad choice (in terms of roles) in the training phase.
This is not allowed I guess. Maybe an agent wants to ruin everything and just act randomly and goes right many times during execution.
In other words, are you allowed to impose structural limitation on agents' behaviors in the execution time (like obeying limited number of roles that have been learned through the training), based on learning procedure?
BTW, for your first bullet point, role_outputs
is being used elsewhere in the RODEMAC.forward
. It directly provides
self.selected_roles = self.role_selector.select_role(role_outputs, test_mode=test_mode, t_env=t_env).squeeze()
and self.selected_roles
is used inside line (*****)
.
As the definition goes, CTDE means, "The learning algorithm has access to all local action-observation histories and global state s, but each agent’s learnt policy can condition only on its own action-observation history."
In this work during the executions,
RODEMAC.select_actions
will be called and it containsRODEMAC.forward
method without any changes compare to the time whenRODEMAC.forward
will be called during the training vialearner
module. This makes me doubt categorizing RODE as CTDE.For instance when we are in the execution time, why should agents know that there are
n_roles
(for instance 3) roles out there? This is an extra information. In the training you can train agents to follow one of the roles, but not in the execution. For instance we consider5m_vs_6m
smac map, why should agents hard coded into three roles during their execution?Or same story with
role_latent.
During the training you can train a latent space representing roles you defined extra to the problem. But during the execution still you select the roles based on this latent space and agents must follow one of the roles dependent on extra information. Let me be more exact here with the dimensions. During the training you train a space called,role_latent_reshaped
in the classdot_selector
, assuming you are using thedot_selector.
This space has the shape of(bs * n_agents, n_roles, action_latent_dim)
. For5m_vs_6m
smac map, during execution (Considering bs=32) it has the shape of(160 = 32 * 5, 3, 20).
In other words, you use a latent space which connects 5 agents to 3 roles even in the execution time. This is somewhat close to the mixing networks that should be turned off during execution. That(5 ---> 3)
means 5 agents interact with each other to build the 3 roles and the learned latent space cannot be considered for the part of the CTDE definition which goes, "agent’s learnt policy can condition only on its own action-observation history." Not each other information.To compare with another familiar algorithm, please look at the ROMA algorithm implementation. In ROMA, extra contributions are mainly in the agent module. During execution when we use
ROMA.select_actions
, variabletest_mode=False
turns off all the extra work of the agents and their interactions during the execution. But in RODE implementationsRODEMAC.forward
doesn't use such a variable when it is needed.Thanks for your time and explanation