Open vinerich opened 3 years ago
I'm glad to hear you made it
This line is based on paper2, page 6. But I didn't remember much well mathematical theory of this. I think I'll try to write down an explanation to refresh my memory :D
I'm reading the paper and I came to gradually remember that. I think you can understand the theory and concepts by reading the paper. I will write a summary briefly as follows.
First of all.
Of course we want to optimize the policy (not target policy) by calculating gradients through μ and A. Then we should write MultivariateNormal(loc=μ, scale_tril=A) at the line as described in the paper eq (4). But the paper says this leads to premature convergence. Because it's natural that A conveges to low value to make sure to get an action which leads to maximum reward. This could happen in the early stage of training. To avoid this, we learn μ and A separately. The term of MultivariateNormal(loc=b_μ, scale_tril=A) learns only A to get better reward, this preserves relatively high value of A.
Thank you!
Thanks for the thorough explaination! Makes total sense to me, and the paper explaines it well too. I only had a look at Paper1 as it seems 😄
Further question is related to the number dimension of loss_p
. (I honestly don't know the english work but its about wheter its 10e1, 10e2, etc ...)
This is an excerpt from my current training:
actor_policy_loss
is loss_p
max_kl_mean
and max_kl_std
are η_kl_μ
and η_kl_Σ
respectively.
This is at the later stages of training but they really don't seem to have that much impact. Also beacuse your default kl_mean_constraint
sits on 0.01 so "negating" every effort by the KL.
What am I getting wrong here?
Thanks again!! Your explanations really help to wrap my head around this algorithm.
Sorry, I need some organized information.
actor_loss
is -(actor_policy_loss + max_kl_mean * (ε_kl_μ - kl_μ) + max_kl_std * (ε_kl_Σ - kl_Σ))
like this line ?
Please pay attention that max_kl_μ=max(history of kl_μ)
like this line in my code. max_kl_Σ
is also the same.
So, I have a some confusion. Your max_kl_mean
means max_kl_mean=max(history of kl_μ)
or max_kl_mean=η_kl_μ
for sure?
By the way, it can happen that kl_μ
and kl_Σ
(line) have no impact. Because if kl_μ
is less than self.ε_kl_μ
, self.η_kl_μ
goes to zero (line). If kl_μ
becomes greater than self.ε_kl_μ
, we try to learn to decrease kl_μ
. So eventually kl_μ
and η_kl_μ
should be no impact to loss_l
(line). kl_Σ
and η_kl_Σ
are also the same. These specifications are correspond to the paper p.5~p.6.
Thanks!
Sorry for the bad wordings there I'll try to clean that up!
actor_loss = -(
actor_loss_p
+ self.η_kl_μ * (self.ε_kl_μ - kl_μ)
+ self.η_kl_Σ * (self.ε_kl_Σ - kl_Σ)
)
max_kl_mean = max(history of kl_μ)
max_kl_std = max(history ok kl_Σ)
mean_kl_mean = mean(history of kl_μ)
mean_kl_std = mean(history ok kl_Σ)
By the way, it can happen that kl_μ and kl_Σ (line) have no impact. Because if kl_μ is less than self.ε_kl_μ, self.η_kl_μ goes to zero (line). If kl_μ becomes greater than self.ε_kl_μ, we try to learn to decrease kl_μ. So eventually kl_μ and η_kl_μ should be no impact to loss_l (line). kl_Σ and η_kl_Σ are also the same. These specifications are correspond to the paper p.5~p.6.
This lines answered exactly my question. I was wondering if the dimensions for parameters ε_kl_μ
and ε_kl_Σ
are correct, since they effectively zero out the impact of kl_μ
and kl_Σ
.
Thanks for the hint to the paper! I start to gradually understand the inner workings of this algorithm 👍🏾
Hey Dai,
On lines 291:298 of mpo.py u compute a
loss_p
. Could you explain why the parameters in the construction of the distributions are switched?In my understanding this should not be the case as we are interested in the real probabilites and not some sort of mixed stuff.
Could you further explain what loss exactly is computed there, since it is not present in the repo your implementation is taken from.
My background is a little not existent on the MPO mathematical background, and I sadly can't wrap my head around the equations in the original paper..
Thanks in advance!