Closed liuda1064838990 closed 1 year ago
Hi. HASAC is an instance of MEHAML template with a drift functional of 0 and a neighborhood of $\Pi$. Therefore, the policy update of HASAC can satisfy Equation 28 in Lemma G.2, so that Lemma G.2 assures that the resulting policies satisfy condition Equation 25 in Lemma G.1. I hope my answer can clear up your confusion,
The mathematical proof in the article is very detailed, but I still have a question that is not clear. The goal of the Actor network in each state is to maximize the soft Q function plus the expected future entropy. But, how can the assumption of Formula 25 in Lemma G.1. be guaranteed in the update of the actor work? I am very confused about this question and look forward to your answer.