Open 98luobo opened 1 year ago
Hi, bro Are there any hidden parts that can't be told?
Thanks for your question. But I was not was very clear about your question, maybe I misunderstood your meaning. For now, my thought is that roles should be selected during both the training and test phases. How can we train the role selector if training and test settings are different?
The role selector is based on local information and can run in both centralized training and decentralized execution phases.
There is no hidden part.
On Fri, Nov 18, 2022 at 1:54 AM 98luobo @.***> wrote:
Hi, bro Are there any hidden parts that can't be told?
— Reply to this email directly, view it on GitHub https://github.com/TonghanWang/RODE/issues/12#issuecomment-1319618417, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIX4MOXJKLDWGVQ3Z2CNI5TWI4R2RANCNFSM6AAAAAASBUFDQM . You are receiving this because you are subscribed to this thread.Message ID: @.***>
Roles are selected every several (5 in the experiments) time steps. This periodical selection is for better adjusting to the dynamic requirements of the task. The policy are trained under this scheme, and naturally should be tested under the same scheme.
Hi, Bro, Thank you very much for your reply. But I still doubt this mechanism. Let's return to the implementation and training of QMIX. Decentralized implementation is OK, but centralized training requires data from ReplayBuffer. For the original QMIX, it uses the stored<O, A, S, Reward>instead of selecting a new A for each agent. This actually corresponds to the Role Selector you proposed. Selecting a role is equivalent to an agent called a "role selector "making actions. Its training and implementation should also strictly follow the flow of off policy algorithms such as DQN, right?
I try to fix the selected role during training. Specifically, the role information is completely taken from ReplayBuffer, such as<S, O, A, Reward, Role>. But to tell the truth, the effect is very poor.
In my opinion, the role selection is not an action. It is like a latent variable conditioned on what we have in the replay buffer. We did not use more information than what is stored in the buffer.
As for your idea, I think it is like use a supervised learning objective to train the role selector, rather than a reinforcement learning objective. Maybe you can try to mix these two losses and see what will happen.
On Fri, Nov 18, 2022 at 5:55 AM 98luobo @.***> wrote:
Hi, Bro, Thank you very much for your reply. But I still doubt this mechanism. Let's return to the implementation and training of QMIX. Decentralized implementation is OK, but centralized training requires data from ReplayBuffer. For the original QMIX, it uses the stored<O, A, S, Reward>instead of selecting a new A for each agent. This actually corresponds to the Role Selector you proposed. Selecting a role is equivalent to an agent called a "role selector "making actions. Its training and implementation should also strictly follow the flow of off policy algorithms such as DQN, right?
I try to fix the selected role during training. Specifically, the role information is completely taken from ReplayBuffer, such as<S, O, A, Reward, Role>. But to tell the truth, the effect is very poor.
— Reply to this email directly, view it on GitHub https://github.com/TonghanWang/RODE/issues/12#issuecomment-1319839918, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIX4MOWXMLUW4ZADOJM3IXLWI5OBLANCNFSM6AAAAAASBUFDQM . You are receiving this because you commented.Message ID: @.***>
Hi, bro I have a question that I have always wondered. Is it correct that in rode_controller()'s "forward()" function, the role selector you defined chooses new roles for agents no matter in decentralized execution or centralized training? """ def forward(self, ep_batch, t, test_mode=False, t_env=None): agent_inputs = self._build_inputs(ep_batch, t)
""" I don't understand why it is necessary to reselect the role of agents when these data are sampled based on ReplayBuffer during training. I look forward to your answer.