Closed kkkaiaiai closed 3 months ago
Hi, as you mentioned, we set the maximum possible entropy as the target entropy. We do this to ensure sufficient exploration in each task, and we indeed find that such a target entropy achieves good results in most tasks. However, it is possible that in some tasks, having a target entropy that is too high may cause the policy to approach a uniform distribution, resulting in suboptimal performance. In such cases, further adjustments to the target entropy may be necessary.
Thanks for your reply!
I'm still a little confused. Taking the singe-state 2-agent cooperative matrix game in HASAC paper as an example, when the policy converges to the optimal policy, i.e., [0, 0, 1], due to the mechanism of auto-alpha, the policy will bounce back to the uniform policy due to the increase of entropy, and then, due to the mechanism of auto-alpha, the policy will converge to [0, 0, 1] again. As a result, the algorithm might fail to converge. I wondered if this would be the case for tasks where auto-alpha performed well? Or maybe I was just misinterpreting it.
Looking forward to your reply!
Yes, the situation you described is a case where auto-alpha performs well. Initially, the policy will tend toward a uniform distribution and explore sufficiently due to the increase of entropy. Once the optimal policy, such as [0,0,1], is discovered, the policy will then converge towards the optimal strategy.
Thank you for your response. Perhaps I still don't understand your meaning, please forgive me. I wonder whether, when the algorithm converges to the optimal strategy, such as [0,0,1], the small entropy of the strategy will greatly increase the value of alpha, causing the algorithm to bounce back from the optimal strategy to the uniform strategy.
Sorry for the late reply. When the policy has converged to an optimal strategy, such as [0,0,1], the policy will not return to the uniform distribution from the optimal policy. This is because the optimal policy that has been converged to is already the result of the algorithm balancing the maximization of reward and entropy under the current alpha value. Please note that different alpha values only allow the algorithm to maximize reward while maintaining exploration as much as possible.
Thank you very much!
Hi, I have a question about the auto-alpha.
I noticed that you set target_entropy to the maximum value in the code, which seems to cause alpha to get bigger every time the entropy of the algorithm goes down. Therefore, I wonder if this causes the algorithm to bounce back to the uniform policy every time it converges to some better policy?
Looking forward to your reply!