Replicable-MARL / MARLlib

One repository is all that is necessary for Multi-agent Reinforcement Learning (MARL)
https://marllib.readthedocs.io
MIT License
945 stars 153 forks source link

Problem on implementation of HAPPO #66

Closed Wangjw6 closed 2 years ago

Wangjw6 commented 2 years ago

After testing HAPPO, I found that in happo_surrogate_loss, no other agents are considered for each self-agent. I wonder if there is any problem?

mrvgao commented 2 years ago

hello, the Ray RLlib is designed with singleton updating. Therefore, in order to implement the heterogeneous updating, we cannot use the loss function directly.

Here is the implementation:

First, in https://github.com/Replicable-MARL/MARLlib/blob/bb3b9a296a7792486d4cceb6e3175a8140ba4db1/marl/algos/core/CC/happo.py#L74

    for i, iter_train_info in enumerate(get_each_agent_train(model, policy, dist_class, train_batch)):
        iter_model, iter_dist_class, iter_train_batch, iter_mask, \
            iter_reduce_mean, iter_actions, iter_policy, iter_prev_action_logp = iter_train_info

There is a loop. In this loop, we get each agent's model, batch, etc. These data were collected in postprocessing:

.

Secondly, we will train each model in https://github.com/Replicable-MARL/MARLlib/blob/bb3b9a296a7792486d4cceb6e3175a8140ba4db1/marl/algos/core/CC/happo.py#L114

Thirdly, we will use trained each agent's model to get the updated sampling importance to get the $M-advantage$

Fourthly, in order to get each agent's model result, there is an important class need to be mentioned:

https://github.com/Replicable-MARL/MARLlib/blob/bb3b9a296a7792486d4cceb6e3175a8140ba4db1/marl/algos/utils/heterogeneous_updateing.py#L54

Because of this class, we could get each agent's data by re-implement the builtin functions: getitem

By the combination of the above ways, we could implement the updating for related agents sequentially and dependently in these ways.

Am I clear on this?

Wangjw6 commented 2 years ago

hello, the Ray RLlib is designed with singleton updating. Therefore, in order to implement the heterogeneous updating, we cannot use the loss function directly.

Here is the implementation:

First, in

https://github.com/Replicable-MARL/MARLlib/blob/bb3b9a296a7792486d4cceb6e3175a8140ba4db1/marl/algos/core/CC/happo.py#L74

    for i, iter_train_info in enumerate(get_each_agent_train(model, policy, dist_class, train_batch)):
        iter_model, iter_dist_class, iter_train_batch, iter_mask, \
            iter_reduce_mean, iter_actions, iter_policy, iter_prev_action_logp = iter_train_info

There is a loop. In this loop, we get each agent's model, batch, etc. These data were collected in postprocessing:

.

Secondly, we will train each model in

https://github.com/Replicable-MARL/MARLlib/blob/bb3b9a296a7792486d4cceb6e3175a8140ba4db1/marl/algos/core/CC/happo.py#L114

Thirdly, we will use trained each agent's model to get the updated sampling importance to get the M−advantage

Fourthly, in order to get each agent's model result, there is an important class need to be mentioned:

https://github.com/Replicable-MARL/MARLlib/blob/bb3b9a296a7792486d4cceb6e3175a8140ba4db1/marl/algos/utils/heterogeneous_updateing.py#L54

Because of this class, we could get each agent's data by re-implement the builtin functions: getitem

By the combination of the above ways, we could implement the updating for related agents sequentially and dependently in these ways.

Am I clear on this?

Thanks for the response. Basically, I understand the logic. What I was confused about is that I ran the command: python main.py --algo_config=happo --env_config=lbf. Then I output the all_polices_with_names in the https://github.com/Replicable-MARL/MARLlib/blob/bb3b9a296a7792486d4cceb6e3175a8140ba4db1/marl/algos/utils/heterogeneous_updateing.py#L54:~:text=get_each_agent_train, it turns out that only the self policy is in the all_polices_with_names. Then what is the meaning of iteration in all_polices_with_names ?

mrvgao commented 2 years ago

hello, the Ray RLlib is designed with singleton updating. Therefore, in order to implement the heterogeneous updating, we cannot use the loss function directly. Here is the implementation: First, in https://github.com/Replicable-MARL/MARLlib/blob/bb3b9a296a7792486d4cceb6e3175a8140ba4db1/marl/algos/core/CC/happo.py#L74

    for i, iter_train_info in enumerate(get_each_agent_train(model, policy, dist_class, train_batch)):
        iter_model, iter_dist_class, iter_train_batch, iter_mask, \
            iter_reduce_mean, iter_actions, iter_policy, iter_prev_action_logp = iter_train_info

There is a loop. In this loop, we get each agent's model, batch, etc. These data were collected in postprocessing: . Secondly, we will train each model in https://github.com/Replicable-MARL/MARLlib/blob/bb3b9a296a7792486d4cceb6e3175a8140ba4db1/marl/algos/core/CC/happo.py#L114

Thirdly, we will use trained each agent's model to get the updated sampling importance to get the M−advantage Fourthly, in order to get each agent's model result, there is an important class need to be mentioned: https://github.com/Replicable-MARL/MARLlib/blob/bb3b9a296a7792486d4cceb6e3175a8140ba4db1/marl/algos/utils/heterogeneous_updateing.py#L54

Because of this class, we could get each agent's data by re-implement the builtin functions: getitem By the combination of the above ways, we could implement the updating for related agents sequentially and dependently in these ways. Am I clear on this?

Thanks for the response. Basically, I understand the logic. What I was confused about is that I ran the command: python main.py --algo_config=happo --env_config=lbf. Then I output the all_polices_with_names in the https://github.com/Replicable-MARL/MARLlib/blob/bb3b9a296a7792486d4cceb6e3175a8140ba4db1/marl/algos/utils/heterogeneous_updateing.py#L54:~:text=get_each_agent_train, it turns out that only the self policy is in the all_polices_with_names. Then what is the meaning of iteration in all_polices_with_names ?

hello, This is a feature that existed in Ray actually. It is pretty tricky.

Here is the thing:

If there are N agents existing in a task.

At the first steps sampling step, the RLlib will only run ONLY ONE agent when sampling.

But after some steps, maybe like some kind of "warming up", all agents will occur.

For example, if we check the postprocessing function in this place:

https://github.com/Replicable-MARL/MARLlib/blob/bb3b9a296a7792486d4cceb6e3175a8140ba4db1/marl/algos/utils/centralized_critic_hetero.py#L301

We will find, in the beginning, steps, the crucial parameters: other_agent_batches is None, which means RLlib only run a single agent.

In order to solve this problem, I implement the connect and share function here:

https://github.com/Replicable-MARL/MARLlib/blob/bb3b9a296a7792486d4cceb6e3175a8140ba4db1/marl/algos/utils/centralized_critic_hetero.py#L201

This function will check if there are 'other_agent_batches' in the sampling batch. And there is it, the function will firstly share the critic with each agents and connect the other agent's information to the main agent, which is sent to the loss running.

In short, in order to solve this,

you need:

Firstly, make sure that there are more than one agent in this task;

Secondly, waiting for a short minute, more agent information will occur.

Am I clear about it?

Wangjw6 commented 2 years ago

Awesome! Thanks for the clarification.

mrvgao commented 2 years ago

Awesome! Thanks for the clarification.

You are welcome. It's my pleasure. :)