Closed Darthnash closed 6 years ago
Hi @Darthnash,
The entropy in continuous control models is a function of the log_sigma_sq
variable. It corresponds to the spread of the gaussian distribution, with larger entropy meaning larger spread, and as such more variation in sampled action values. During training we use an entropy regularizer in the loss to encourage policies which have greater spread in their distribution, which corresponds to encouraging the agent to explore the action space. I hope that answers your question.
Hi @awjuliani,
thanks a lot for your answer!
I think i got it now. So far I thought that the entropy was calculated after the policy was already found in order to rate the new policy. But instead sigma_sq
is a parameter that is modified during optimization in order to find the new policy and in a way that new actions are chosen. Is that more or less correct?
Just one more question - Is there a reason that log_sigma_sq
is defined as variable and not sigma_sq
directly?
Hi @Darthnash,
sigma_sq
is used both to select actions and as a means to measure entropy. During the update phase it is updated both in order to increase the reward and in order to increase the entropy. The log is typically used because it has nicer mathematical properties for optimization and representation by a neural network.
Thank you @awjuliani for your quick response.
Is the updated sigma then independent of the state that the agent is actually in?
I'm sorry that I keep bothering you but there is another thing that I can't wrap my head around: does the usage of tf.stop_gradient
in Line 315 mean that the hidden_policy
network that is used to output the policy is not affected by optimization?
Hi @Darthnash ,
You are correct. Sigma is independent of the state.
We use tf.stop_gradients
in places where certain values are needed to compute others, but we don't want their result to impact the gradients for those parts of the network. In the case of the selected output from the policy stream, we don't want it to change, we only want the mu
and sigma
values which produced it to be effected.
Hi @awjuliani,
thank you for your great help so far.
Unfortunately I have another question about the implementation:
why is the value error computed 2 times in Line 189 & 190 of the ppo optimizer and what is the use of tf.dynamic_partition
as mask
is set to 1.0 for every step anyway?
@Darthnash,
The dynamic_partition
is used to to mask elements when using recurrent neural networks. When you aren't using an RNN, then it doesn't mask anything.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
Hi there! I'm currently trying to figure out the implementation of the
entropy
variable.So far, I found the recalculation of
self.entropy
in line 320 of the LearningModel Class and the calculation ofsigma
a few lines above. However I cannot see how the variablelog_sigma_sq
in Line 301 is modified during training.Is it correct that this entropy is the differential entropy that is used to measure how surprising the output of the policiy distribution is?
Any help would be much appreciated!