Question about Implementation of Entropy

Unity-Technologies / ml-agents

The Unity Machine Learning Agents Toolkit (ML-Agents) is an open-source project that enables games and simulations to serve as environments for training intelligent agents using deep reinforcement learning and imitation learning.

https://unity.com/products/machine-learning-agents

Other

17.1k stars 4.15k forks source link

Question about Implementation of Entropy #890

Closed Darthnash closed 6 years ago

Darthnash commented 6 years ago

Hi there! I'm currently trying to figure out the implementation of the entropy variable.

So far, I found the recalculation of self.entropy in line 320 of the LearningModel Class and the calculation of sigma a few lines above. However I cannot see how the variable log_sigma_sq in Line 301 is modified during training.
Is it correct that this entropy is the differential entropy that is used to measure how surprising the output of the policiy distribution is?

Any help would be much appreciated!

awjuliani commented 6 years ago

Hi @Darthnash,

The entropy in continuous control models is a function of the log_sigma_sq variable. It corresponds to the spread of the gaussian distribution, with larger entropy meaning larger spread, and as such more variation in sampled action values. During training we use an entropy regularizer in the loss to encourage policies which have greater spread in their distribution, which corresponds to encouraging the agent to explore the action space. I hope that answers your question.

Darthnash commented 6 years ago

Hi @awjuliani,

thanks a lot for your answer!

I think i got it now. So far I thought that the entropy was calculated after the policy was already found in order to rate the new policy. But instead sigma_sq is a parameter that is modified during optimization in order to find the new policy and in a way that new actions are chosen. Is that more or less correct?

Just one more question - Is there a reason that log_sigma_sq is defined as variable and not sigma_sq directly?

awjuliani commented 6 years ago

Hi @Darthnash,

sigma_sq is used both to select actions and as a means to measure entropy. During the update phase it is updated both in order to increase the reward and in order to increase the entropy. The log is typically used because it has nicer mathematical properties for optimization and representation by a neural network.

Darthnash commented 6 years ago

Thank you @awjuliani for your quick response.

Is the updated sigma then independent of the state that the agent is actually in?

I'm sorry that I keep bothering you but there is another thing that I can't wrap my head around: does the usage of tf.stop_gradient in Line 315 mean that the hidden_policy network that is used to output the policy is not affected by optimization?

awjuliani commented 6 years ago

Hi @Darthnash ,

You are correct. Sigma is independent of the state.

We use tf.stop_gradients in places where certain values are needed to compute others, but we don't want their result to impact the gradients for those parts of the network. In the case of the selected output from the policy stream, we don't want it to change, we only want the mu and sigma values which produced it to be effected.

Darthnash commented 6 years ago

Hi @awjuliani, thank you for your great help so far. Unfortunately I have another question about the implementation: why is the value error computed 2 times in Line 189 & 190 of the ppo optimizer and what is the use of tf.dynamic_partition as mask is set to 1.0 for every step anyway?

awjuliani commented 6 years ago

@Darthnash,

The dynamic_partition is used to to mask elements when using recurrent neural networks. When you aren't using an RNN, then it doesn't mask anything.

lock[bot] commented 4 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.