Question: on loss_p calculation

daisatojp / mpo

PyTorch Implementation of the Maximum a Posteriori Policy Optimisation

GNU General Public License v3.0

70 stars 19 forks source link

Question: on loss_p calculation #7

Open vinerich opened 3 years ago

vinerich commented 3 years ago

Hey Dai,

On lines 291:298 of mpo.py u compute a loss_p. Could you explain why the parameters in the construction of the distributions are switched?

In my understanding this should not be the case as we are interested in the real probabilites and not some sort of mixed stuff.

Could you further explain what loss exactly is computed there, since it is not present in the repo your implementation is taken from.

My background is a little not existent on the MPO mathematical background, and I sadly can't wrap my head around the equations in the original paper..

Thanks in advance!

daisatojp commented 3 years ago

I'm glad to hear you made it

This line is based on paper2, page 6. But I didn't remember much well mathematical theory of this. I think I'll try to write down an explanation to refresh my memory :D

daisatojp commented 3 years ago

I'm reading the paper and I came to gradually remember that. I think you can understand the theory and concepts by reading the paper. I will write a summary briefly as follows.

First of all.

Policy is represented as multivariate normal distribution.
μ and A are parameters of the distribution derived from the policy we want to optimize.
μ_b and A_b are parameters of the distribution derived from the target policy (old policy).

Of course we want to optimize the policy (not target policy) by calculating gradients through μ and A. Then we should write MultivariateNormal(loc=μ, scale_tril=A) at the line as described in the paper eq (4). But the paper says this leads to premature convergence. Because it's natural that A conveges to low value to make sure to get an action which leads to maximum reward. This could happen in the early stage of training. To avoid this, we learn μ and A separately. The term of MultivariateNormal(loc=b_μ, scale_tril=A) learns only A to get better reward, this preserves relatively high value of A.

Thank you!

vinerich commented 3 years ago

Thanks for the thorough explaination! Makes total sense to me, and the paper explaines it well too. I only had a look at Paper1 as it seems 😄

Further question is related to the number dimension of loss_p. (I honestly don't know the english work but its about wheter its 10e1, 10e2, etc ...)

This is an excerpt from my current training: grafik

actor_policy_loss is loss_p max_kl_mean and max_kl_std are η_kl_μ and η_kl_Σ respectively. This is at the later stages of training but they really don't seem to have that much impact. Also beacuse your default kl_mean_constraint sits on 0.01 so "negating" every effort by the KL.

What am I getting wrong here?

Thanks again!! Your explanations really help to wrap my head around this algorithm.

daisatojp commented 3 years ago

Sorry, I need some organized information. actor_loss is -(actor_policy_loss + max_kl_mean * (ε_kl_μ - kl_μ) + max_kl_std * (ε_kl_Σ - kl_Σ)) like this line ? Please pay attention that max_kl_μ=max(history of kl_μ) like this line in my code. max_kl_Σ is also the same. So, I have a some confusion. Your max_kl_mean means max_kl_mean=max(history of kl_μ) or max_kl_mean=η_kl_μ for sure?

By the way, it can happen that kl_μ and kl_Σ (line) have no impact. Because if kl_μ is less than self.ε_kl_μ, self.η_kl_μ goes to zero (line). If kl_μ becomes greater than self.ε_kl_μ, we try to learn to decrease kl_μ. So eventually kl_μ and η_kl_μ should be no impact to loss_l (line). kl_Σ and η_kl_Σ are also the same. These specifications are correspond to the paper p.5~p.6.

Thanks!

vinerich commented 3 years ago

Sorry for the bad wordings there I'll try to clean that up!

actor_loss = -(
    actor_loss_p
   + self.η_kl_μ * (self.ε_kl_μ - kl_μ)
   + self.η_kl_Σ * (self.ε_kl_Σ - kl_Σ)
)

max_kl_mean = max(history of kl_μ)
max_kl_std     = max(history ok kl_Σ)

mean_kl_mean = mean(history of kl_μ)
mean_kl_std     = mean(history ok kl_Σ)

By the way, it can happen that kl_μ and kl_Σ (line) have no impact. Because if kl_μ is less than self.ε_kl_μ, self.η_kl_μ goes to zero (line). If kl_μ becomes greater than self.ε_kl_μ, we try to learn to decrease kl_μ. So eventually kl_μ and η_kl_μ should be no impact to loss_l (line). kl_Σ and η_kl_Σ are also the same. These specifications are correspond to the paper p.5~p.6.

This lines answered exactly my question. I was wondering if the dimensions for parameters ε_kl_μ and ε_kl_Σ are correct, since they effectively zero out the impact of kl_μ and kl_Σ.

Thanks for the hint to the paper! I start to gradually understand the inner workings of this algorithm 👍🏾