Why it can choose new role during training?

98luobo commented 2 years ago

Hi, bro I have a question that I have always wondered. Is it correct that in rode_controller()'s "forward()" function, the role selector you defined chooses new roles for agents no matter in decentralized execution or centralized training? """ def forward(self, ep_batch, t, test_mode=False, t_env=None): agent_inputs = self._build_inputs(ep_batch, t)

    # select roles
    self.role_hidden_states = self.role_agent(agent_inputs, self.role_hidden_states)
    role_outputs = None
    if t % self.role_interval == 0:
        role_outputs = self.role_selector(self.role_hidden_states, self.role_latent)
        self.selected_roles = self.role_selector.select_role(role_outputs, test_mode=test_mode, t_env=t_env).squeeze()

""" I don't understand why it is necessary to reselect the role of agents when these data are sampled based on ReplayBuffer during training. I look forward to your answer.

98luobo commented 2 years ago

Hi, bro Are there any hidden parts that can't be told?

TonghanWang commented 2 years ago

Thanks for your question. But I was not was very clear about your question, maybe I misunderstood your meaning. For now, my thought is that roles should be selected during both the training and test phases. How can we train the role selector if training and test settings are different?

The role selector is based on local information and can run in both centralized training and decentralized execution phases.

There is no hidden part.

On Fri, Nov 18, 2022 at 1:54 AM 98luobo @.***> wrote:

Hi, bro Are there any hidden parts that can't be told?

— Reply to this email directly, view it on GitHub https://github.com/TonghanWang/RODE/issues/12#issuecomment-1319618417, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIX4MOXJKLDWGVQ3Z2CNI5TWI4R2RANCNFSM6AAAAAASBUFDQM . You are receiving this because you are subscribed to this thread.Message ID: @.***>

TonghanWang commented 2 years ago

Roles are selected every several (5 in the experiments) time steps. This periodical selection is for better adjusting to the dynamic requirements of the task. The policy are trained under this scheme, and naturally should be tested under the same scheme.

98luobo commented 2 years ago

Hi, Bro, Thank you very much for your reply. But I still doubt this mechanism. Let's return to the implementation and training of QMIX. Decentralized implementation is OK, but centralized training requires data from ReplayBuffer. For the original QMIX, it uses the stored<O, A, S, Reward>instead of selecting a new A for each agent. This actually corresponds to the Role Selector you proposed. Selecting a role is equivalent to an agent called a "role selector "making actions. Its training and implementation should also strictly follow the flow of off policy algorithms such as DQN, right?

I try to fix the selected role during training. Specifically, the role information is completely taken from ReplayBuffer, such as<S, O, A, Reward, Role>. But to tell the truth, the effect is very poor.

TonghanWang commented 2 years ago

In my opinion, the role selection is not an action. It is like a latent variable conditioned on what we have in the replay buffer. We did not use more information than what is stored in the buffer.

As for your idea, I think it is like use a supervised learning objective to train the role selector, rather than a reinforcement learning objective. Maybe you can try to mix these two losses and see what will happen.

On Fri, Nov 18, 2022 at 5:55 AM 98luobo @.***> wrote:

Hi, Bro, Thank you very much for your reply. But I still doubt this mechanism. Let's return to the implementation and training of QMIX. Decentralized implementation is OK, but centralized training requires data from ReplayBuffer. For the original QMIX, it uses the stored<O, A, S, Reward>instead of selecting a new A for each agent. This actually corresponds to the Role Selector you proposed. Selecting a role is equivalent to an agent called a "role selector "making actions. Its training and implementation should also strictly follow the flow of off policy algorithms such as DQN, right?

I try to fix the selected role during training. Specifically, the role information is completely taken from ReplayBuffer, such as<S, O, A, Reward, Role>. But to tell the truth, the effect is very poor.

— Reply to this email directly, view it on GitHub https://github.com/TonghanWang/RODE/issues/12#issuecomment-1319839918, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIX4MOWXMLUW4ZADOJM3IXLWI5OBLANCNFSM6AAAAAASBUFDQM . You are receiving this because you commented.Message ID: @.***>

TonghanWang / RODE

Why it can choose new role during training? #12