Closed Wangjw6 closed 2 years ago
hello, the Ray RLlib is designed with singleton updating. Therefore, in order to implement the heterogeneous updating, we cannot use the loss function directly.
Here is the implementation:
for i, iter_train_info in enumerate(get_each_agent_train(model, policy, dist_class, train_batch)):
iter_model, iter_dist_class, iter_train_batch, iter_mask, \
iter_reduce_mean, iter_actions, iter_policy, iter_prev_action_logp = iter_train_info
There is a loop. In this loop, we get each agent's model, batch, etc. These data were collected in postprocessing:
.
Secondly, we will train each model in https://github.com/Replicable-MARL/MARLlib/blob/bb3b9a296a7792486d4cceb6e3175a8140ba4db1/marl/algos/core/CC/happo.py#L114
Thirdly, we will use trained each agent's model to get the updated sampling importance to get the $M-advantage$
Fourthly, in order to get each agent's model result, there is an important class need to be mentioned:
Because of this class, we could get each agent's data by re-implement the builtin functions: getitem
By the combination of the above ways, we could implement the updating for related agents sequentially and dependently in these ways.
Am I clear on this?
hello, the Ray RLlib is designed with singleton updating. Therefore, in order to implement the heterogeneous updating, we cannot use the loss function directly.
Here is the implementation:
First, in
for i, iter_train_info in enumerate(get_each_agent_train(model, policy, dist_class, train_batch)): iter_model, iter_dist_class, iter_train_batch, iter_mask, \ iter_reduce_mean, iter_actions, iter_policy, iter_prev_action_logp = iter_train_info
There is a loop. In this loop, we get each agent's model, batch, etc. These data were collected in postprocessing:
.
Secondly, we will train each model in
Thirdly, we will use trained each agent's model to get the updated sampling importance to get the M−advantage
Fourthly, in order to get each agent's model result, there is an important class need to be mentioned:
Because of this class, we could get each agent's data by re-implement the builtin functions: getitem
By the combination of the above ways, we could implement the updating for related agents sequentially and dependently in these ways.
Am I clear on this?
Thanks for the response. Basically, I understand the logic. What I was confused about is that I ran the command:
python main.py --algo_config=happo --env_config=lbf
.
Then I output the all_polices_with_names in the https://github.com/Replicable-MARL/MARLlib/blob/bb3b9a296a7792486d4cceb6e3175a8140ba4db1/marl/algos/utils/heterogeneous_updateing.py#L54:~:text=get_each_agent_train, it turns out that only the self policy is in the all_polices_with_names. Then what is the meaning of iteration in all_polices_with_names ?
hello, the Ray RLlib is designed with singleton updating. Therefore, in order to implement the heterogeneous updating, we cannot use the loss function directly. Here is the implementation: First, in https://github.com/Replicable-MARL/MARLlib/blob/bb3b9a296a7792486d4cceb6e3175a8140ba4db1/marl/algos/core/CC/happo.py#L74
for i, iter_train_info in enumerate(get_each_agent_train(model, policy, dist_class, train_batch)): iter_model, iter_dist_class, iter_train_batch, iter_mask, \ iter_reduce_mean, iter_actions, iter_policy, iter_prev_action_logp = iter_train_info
There is a loop. In this loop, we get each agent's model, batch, etc. These data were collected in postprocessing: . Secondly, we will train each model in https://github.com/Replicable-MARL/MARLlib/blob/bb3b9a296a7792486d4cceb6e3175a8140ba4db1/marl/algos/core/CC/happo.py#L114
Thirdly, we will use trained each agent's model to get the updated sampling importance to get the M−advantage Fourthly, in order to get each agent's model result, there is an important class need to be mentioned: https://github.com/Replicable-MARL/MARLlib/blob/bb3b9a296a7792486d4cceb6e3175a8140ba4db1/marl/algos/utils/heterogeneous_updateing.py#L54
Because of this class, we could get each agent's data by re-implement the builtin functions: getitem By the combination of the above ways, we could implement the updating for related agents sequentially and dependently in these ways. Am I clear on this?
Thanks for the response. Basically, I understand the logic. What I was confused about is that I ran the command:
python main.py --algo_config=happo --env_config=lbf
. Then I output the all_polices_with_names in the https://github.com/Replicable-MARL/MARLlib/blob/bb3b9a296a7792486d4cceb6e3175a8140ba4db1/marl/algos/utils/heterogeneous_updateing.py#L54:~:text=get_each_agent_train, it turns out that only the self policy is in the all_polices_with_names. Then what is the meaning of iteration in all_polices_with_names ?
hello, This is a feature that existed in Ray actually. It is pretty tricky.
Here is the thing:
If there are N agents existing in a task.
At the first steps sampling step, the RLlib will only run ONLY ONE agent when sampling.
But after some steps, maybe like some kind of "warming up", all agents will occur.
For example, if we check the postprocessing function in this place:
We will find, in the beginning, steps, the crucial parameters: other_agent_batches
is None, which means RLlib only run a single agent.
In order to solve this problem, I implement the connect and share
function here:
This function will check if there are 'other_agent_batches' in the sampling batch. And there is it, the function will firstly share the critic with each agents and connect the other agent's information to the main agent
, which is sent to the loss running.
In short, in order to solve this,
you need:
Firstly, make sure that there are more than one agent in this task;
Secondly, waiting for a short minute, more agent information will occur.
Am I clear about it?
Awesome! Thanks for the clarification.
Awesome! Thanks for the clarification.
You are welcome. It's my pleasure. :)
After testing HAPPO, I found that in happo_surrogate_loss, no other agents are considered for each self-agent. I wonder if there is any problem?