Modifying Continuous Action Output with Softmax in MAPPO/HAPPO

georgewanglz2019 commented 5 months ago

Hello,

Thank you for sharing your code, it has been incredibly useful!

I am currently trying to use your MAPPO or HAPPO to run my tasks where my actions are n-dimensional continuous actions. These actions need to sum up to 1, with each value being greater than or equal to 0 and less than or equal to 1.

To achieve this, I modified your continuous action code by adding a softmax function in the last layer. Specifically, in the forward function of act.py, I added a line after the continuous action code as follows:

python

actions = (
    action_distribution.mode()
    if deterministic
    else action_distribution.sample()
)
# Added line:
actions = torch.softmax(actions, dim=-1)

However, after training for several epochs, the neural network seems to be learning incorrectly, often outputting a large number of NaNs, which causes the program to terminate. I researched online and found that it might be due to gradient explosion or some other issue. Changing the activation function to tanh didn't help much, as it just delayed the occurrence of NaNs by a few more epochs.

Based on the above problem, I would like to seek your advice on how to modify the code to achieve the desired continuous actions. If you are interested, I would greatly appreciate your time and assistance or any ideas you could share.

Thank you very much for your help!

Best regards, George Wang

Ivan-Zhong commented 4 months ago

Hi, thank you for appreciating our work. As for your question, have you modified the calculation of action log probability accordingly?

WangJinCheng1998 commented 4 months ago

Hi! I have the same problem on that. Have you solved that?

Also I tried to directly clip the actions in the environment, but the performance is so bad. Would you mind providing the solution code for that? Thank you so much.

georgewanglz2019 commented 4 months ago

Hi, thank you for appreciating our work. As for your question, have you modified the calculation of action log probability accordingly?

Thanks for your reply! No, I only add softmax() in the last layer of act.py (in forward fuction). I'm not sure whether I need and how to modify the calculation of action log probability. Because after softmax, the distribution of act_out is still a normal distribution. I need to use n-dimentional continuous actions. Could you please tell me how to modify?

georgewanglz2019 commented 4 months ago

Hi! I have the same problem on that. Have you solved that?

Also I tried to directly clip the actions in the environment, but the performance is so bad. Would you mind providing the solution code for that? Thank you so much.

No, the problem is not solved. But if I only add softmax and clip in my environment (step function) and don't modify anything in the mappo code, it works. However, the result is not good enough for me.

Ivan-Zhong commented 4 months ago

Well, what you want is a distribution of n-dimensional continuous actions that are within [0, 1] and sum up to 1. Currently, you add a softmax to the output of normal distribution. While this makes the actions satisfy your requirement, the log probability of the actions cannot be calculated as before, since you need to consider the effect of softmax. A similar case has been discussed and addressed in the SAC paper (Appendix C). You may want to take a look.

An alternative approach, which I think is simpler and more natural, is to use the Dirichlet distribution for actions output. Its support is on

$$x_1, x_2, \ldots, xn \in [0, 1], \sum{i=1}^n x_i = 1.$$

And its log probability can be directly computed. In practice, the agent should learn the suitable parameters for Dirichlet distribution in each state and then sample actions from it.

These are my thoughts on this problem. I have not encountered this requirement in my experiment scenarios, so I have no experience. I hope this will work and am looking forward to your testing results.

georgewanglz2019 commented 4 months ago

Well, what you want is a distribution of n-dimensional continuous actions that are within [0, 1] and sum up to 1. Currently, you add a softmax to the output of normal distribution. While this makes the actions satisfy your requirement, the log probability of the actions cannot be calculated as before, since you need to consider the effect of softmax. A similar case has been discussed and addressed in the SAC paper (Appendix C). You may want to take a look.

An alternative approach, which I think is simpler and more natural, is to use the Dirichlet distribution for actions output. Its support is on

x1,x2,…,xn∈[0,1],∑i=1nxi=1.

And its log probability can be directly computed. In practice, the agent should learn the suitable parameters for Dirichlet distribution in each state and then sample actions from it.

These are my thoughts on this problem. I have not encountered this requirement in my experiment scenarios, so I have no experience. I hope this will work and am looking forward to your testing results.

Thank you very much for your detailed explanation and suggestions!

I will look into the SAC paper to understand how they handle the log probability calculation with the softmax. Also, I will learn about the Dirichlet distribution and how to implement that. I really appreciate your help and will keep you posted on my progress. If you have any further insights or suggestions, they are always welcome.

Thanks again!

Ivan-Zhong commented 4 months ago

Sure, you're welcome. :)

georgewanglz2019 commented 3 months ago

Well, what you want is a distribution of n-dimensional continuous actions that are within [0, 1] and sum up to 1. Currently, you add a softmax to the output of normal distribution. While this makes the actions satisfy your requirement, the log probability of the actions cannot be calculated as before, since you need to consider the effect of softmax. A similar case has been discussed and addressed in the SAC paper (Appendix C). You may want to take a look.

An alternative approach, which I think is simpler and more natural, is to use the Dirichlet distribution for actions output. Its support is on

x 1 , x 2 , … , x n ∈ [ 0 , 1 ] , ∑ i = 1 n x i = 1.

And its log probability can be directly computed. In practice, the agent should learn the suitable parameters for Dirichlet distribution in each state and then sample actions from it.

These are my thoughts on this problem. I have not encountered this requirement in my experiment scenarios, so I have no experience. I hope this will work and am looking forward to your testing results.

Update: The Dirichlet distribution worked well for me, and I achieved better performance in IPPO! I plan to test further and will share more results soon. Thank you for the suggestion!

PKU-MARL / HARL

Modifying Continuous Action Output with Softmax in MAPPO/HAPPO #47