Closed IbrahimSobh closed 7 years ago
For an example see: https://openreview.net/pdf?id=Hk3mPK5gg
To reduce the correlation of game experience, Asynchronous Advantage Actor-Critic Model [Mnih et al. (2016)] runs independent multiple threads of the game environment in parallel. These game instances are likely uncorrelated, therefore their experience in combination would be less biased. For on-policy models, the same mutual reinforcement behavior will also lead to highly-peaked π(a|s) towards a few actions (or a few fixed action sequences), since it is always easy for both actor and critic to over-optimize on a small portion of the environment, and end up “living in their own realities”. To reduce the problem, [Mnih et al. (2016)] added an entropy term to the loss to encourage diversity, which we find to be critical.
Intuitively, it adds more cost to actions that too quickly dominate, and the higher cost favors more exploration (on top of the random e-greediness).
Thank you very much fferreres
Excellent explanation. However:
1- First point:
You have mentioned that:
"it adds more cost to actions that too quickly dominate"
- When we have actions with almost equal probabilities, the added cost (entropy over actions) will be high.
- When we have some dominant action, with higher probability than other actions, the added cost (entropy over actions) will be low.
So, the added cost will be large for random actions?! (not dominant actions) and the added cost will be small in case of dominant action?!
the corresponding code:
self.entropy = -tf.reduce_sum(self.probs * tf.log(self.probs), 1, name="entropy") ... self.losses = - (tf.log(self.picked_action_probs) * self.targets + 0.01 * self.entropy) self.loss = tf.reduce_sum(self.losses, name="loss")
(Sorry, I am confused)
2- Second point:
As I understand, in case of A3C, we do not use e-greedy, only adding entropy for more exploration.
3- Last point:
Regarding re-producing the results of the paper https://github.com/dennybritz/reinforcement-learning/issues/30 I've mentioned that
I think β was 0.01 in the code as in the paper. However, Is it possible, for some reason, that the agent converged to suboptimal policy?
In other words, as mentioned in the paper you shared:
since it is always easy for both actor and critic to over-optimize on a small portion of the environment, and end up “living in their own realities”
Do you think this could be the reason that results peak at score ~35?
Thank you
Entropy = Different
Do you think this could be the reason that results peak at score ~35?
I don't. It helps but even cross-entropy only helps partially. I really think it's a sample size thing. For example look at page 530, how the rewards plateau in the right top figure. Each epoch is 520k frames (steps) and between epoch 60 to 140 the scores improve little or nothing on average. That's 80 epochs or over 40 million steps....before getting much better.
https://storage.googleapis.com/deepmind-data/assets/papers/DeepMindNature14236Paper.pdf
Also on page 12 of pdf they show that regular dqn without replay and separate target network gets below 11 score in breakout (after training for much longer) and linear is around 2 score. So getting to 35 after 2M steps seems normal, we still can't know much about the performance. We just need a computer with a fast gpu like a 1070 at least for 2 days.
Thank you
I have limited access to Amazon g2.2xlarge instance https://aws.amazon.com/ec2/instance-types/ It finished 1 million frames (steps) in around 4 hours (A3C with 8 parallel agents). However, we need to train over 40 million steps which means around 160 hours! (too much time and too much money) :(
Regarding the entropy, I generally understand the idea of adding more cost. However it seems reversed for me! (when entropy is high or low as I elaborated above)
I hope @dennybritz (as usual) can make it clearer for me :)
Thanks again
Yeah, but for A3C I think(this implementation) isn't using GPU (which changes the algorithm a bit to build a queue and update gradients centrally), so g2.2xlarge expensive and not very efficient. Better an instance with 8 CPU and hyper-threading for 16+ workers (much cheaper).
Regarding the entropy, I went and checked what different a3c algorithms implement, and it's what you say, and in fact, I think likewise it doesn't match what is explained in any of the papers. In particular, entropy (eg. Cost) is maximal when probability of the actions is uniform, and tends to zero when further they are from uniform. So a gradient based on that will punish uniform. It encourages changes in params in the direction AWAY from uniform, and can't thus be saying as many papers say, including the A3C explanation:
We also found that adding the entropy of the policy π to the objective function improved exploration by discouraging premature convergence to suboptimal deterministic policies.
But that can't happen unless Google doing something different than what they imply in the paper or at least how it's interpreted.
The original paper everyone cites is this one - although it's missing figures for some reason: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.54.3433&rep=rep1&type=pdf
In it the authors explain that ...
The use of entropy maximization is designed to help keep the search alive by preventing convergence to a single choice of output especially when several choices all lead to roughly the same reinforcement value.
Now, based on the paper, which I clearly can't make full sense of, it appears to say that (my interpretation follows) in certain states, there may make most sense to stay close to a high entropy probability (closer to uniform), if for example those paths all lead to equally good outcomes down the road.
(My explanation) For example, suppose there's an Y junction at a path, each leading to the same place, then very early conversion will stick to left or right, and reinforce that path, but those two paths are equally good really. In this way, early convergence discourages exploration, and since we don't knoll states, likely, we may be missing something that made the other path better. So it'd be better for the algorithm to actually stay close to uniform there to avoid local optima.
It also seems that the entropy equations and what the paper actually states is not the same as being done in the context of the current networks and setups (for many reasons from the paper). Most of the conversation is about REINFORCE, and the regularization being equivalent in a sense to bias regularization. The math does not appear to say to use it as in the current context.
But in the current A3C, I also fail to see how the current implementation actually helps at all, especially in the direction of what it's being attempted (encourage exploration). The current implementations obviously use entropy as in:
... but which will vary depending on the number of actions and log used (Log2 of 2-actions will have 1 as max). I am not sure how to know entropy of a state in very large state spaces with function approximation. Maybe just penalize deviations from uniform? But it would be nice to actually penalize convergence at Y-junctions or states with actions that lead to similar rewards, or where rewards are more uncertain, especially early on. Even something like the negative of what is currently done may be better that the opposite (just let the cost of the regularization be negative - as horrible it may be, that'd encourage exploration).
In partial observability, a great advantage of policy gradient methods is that it directly models action probabilities, which allow a trained agent to bluff in poker (the right proportion of times). But given the current regularization, it doesn't seem to be helping avoid convergence on places where outcome uncertainly is intrinsically high (eg. choices that are equally good or bad), and the current entropy regularization does not help; only simulated anealing greediness is helping it, and if you are unlucky with it, you'll be unlucky for the entire exercise.
So, now I see I have the same question you have, and hope we are not upsetting Danny with deviating from the code to start a conversation about algorithms in general :-) (Danny, let us know if not proper)
Dear fferreres
Your elaboration above is exactly what I meant
"In particular, entropy (eg. Cost) is maximal when probability of the actions is uniform, and tends to zero when further they are from uniform. So a gradient based on that will punish uniform. It encourages changes in params in the direction AWAY from uniform, and can't thus be saying as many papers say ... But that can't happen unless Google doing something different ...."
I am happy that we both have the same confusion about how exactly adding entropy to loss will encourage exploration
In the paper you mentioned:
The use of entropy maximization is designed to help keep the search alive by preventing convergence to a single choice of output
entropy maximization = encourage uniform action probability
I was thinking:
What will happen if we subtract entropy from loss (instead of adding it), to encourage "entropy maximization" not minimization, in order to encourage uniform distribution?
What will happen if we just used e-greedy approach? (select random action at probability e and select action using policy estimator at probability 1-e)
Regarding your question:
I am not sure how to know entropy of a state in very large state spaces with function approximation
I think, and guided by the A3C code, the entropy is calculated for state by just applying the entropy equation on all actions of that state. More specifically, if we are in state (s) and using the policy estimator (using function Neural Network as approximator), then we know probability of all actions of that state. Then simply we calculate the entropy as follows:
H(X) = - Sum over all actions [P(action) log(P(action))]
This happens independent of the state or actions or rewards. It just happens all the time.
Finally, I would like to thank @dennybritz for his great work and educational blog, that enabled us to understand the basics and even implement and discuss details of advanced algorithms.
Yes, I know how it's being calculated as I looked into the code. The reference is that the context of the original paper is not one with function approximation. Also, there are these things that "confuse me" about the paper:
Look at equation 30, then equation 28, then look early in the paper the reference to J as a function built with maximization in mind.
It would be nice to pick a much simpler RL problem, and comparing average rewards for current algorithm as-is, then inverting the sign (which is what the actual paper says due to J as maximization problem), then see the same but use just H as SUM(-ln P(Ai))
I wish I was more proficient in this, so I could just pick a simpler problem (but non trivial) and do the test myself and see if it makes a difference in any direction empirically. But it'd also be nice if someone looked at the paper that every A3C paper references, to actually check the rule of thumb of the entropy term everyone is adding that is likely not doing what they describe (and that you noted first)
And now we have the UNREAL extension to a3c. Denny if you find any sites explaining it more easily would be great to add to the list of algorithms
The UNREAL paper substract the entropy, page 3:
https://arxiv.org/pdf/1611.05397.pdf
Just changing the sign should improve a3c here. Still confused about original paper's proposed update of gradient (just H = -ln( . )).
You have overlooked the minus sign before the whole equation I think: self.losses = - (tf.log(self.picked_action_probs) self.targets + 0.01 self.entropy)
As you know it is equivalent to: self.losses = -tf.log(self.picked_action_probs) self.targets - 0.01 self.entropy
In the original paper Williams et. al. where maximising expected reward. In DeepMind implementation Mnih et. al. have decided to use minimisation algorithm, so they added minus before the whole equation.
@Astrych thanks for pointing it out. I had missed how it is already negative on first look. Wish you were here two weeks ago :-) Thanks!
Thank you so much:
@Astrych
@fferreres
Do you mean that:
I mentioned above that:
"What will happen if we subtract entropy from loss (instead of adding it), to encourage "entropy maximization" not minimization, in order to encourage uniform distribution?"
I want to close this issue after we all agree on something very clearly :)
Finally, I wonder what will happen if we just used e-greedy approach? (select random action at probability e and select action using policy estimator at probability 1-e)
Thank you
We want to minimize the loss
Yes
We actually subtract entropy to loss (not adding it?) which means we encourage high entropy?
Yes
I want to close this issue after we all agree on something very clearly :)
We all agree (so far)
Thank you!
I wonder what will happen if we neglect the entropy thing and just use e-greedy approach? (select random action at probability e and select action using policy estimator at probability 1-e)
I mean instead of adding entropy.
Hello @IbrahimSobh
Late to the party. Did you try e-greedy approach? Any improvements/results?
According to the Asynchronous Methods for Deep Reinforcement Learning, the entropy was added to the policy gradient, which means we want to maximize the log(p) and maximize entropy as well.
Hello, In case of discrete ones, basically high entropy would lead to more random actions. I would like to ask how one calculate the entropy in case of continuous action spaces.
Also, doesn't this have a trade-off of encouraging less certain decisions?
Finally, what do you think is the relationship with adding noise to the output during training, as that would also encourage trying more random actions I guess.
Hello @IbrahimSobh
Late to the party. Did you try e-greedy approach? Any improvements/results?
Late to the party too. Did anyone try e-greedy approach? I don't understand why we need entropy instead of egreedy for exploration.
The main issue is egreedy is an off-policy exploration strategy; the exploration is added on top of a policy. You have certain behavioral policy and every epsilon you act uniformly at random, instead of choosing the action the policy prescribes.
A3C is an on-policy algorithm for the most part (workers getting out of sync makes data a bit off-policy), and it won't deal very well with off-policy data. Notice how entropy promotes an exploration strategy that is part of the stochastic policy, it's baked into the A3C policy. Entropy is just encouraging the policy to stay stochastic, but it is not changing the policy.
@mimoralea, how will entropy not affect the policy? We are using entropy in the loss function. So won't that have an effect on the overall policy?
@Astrych, here you mentioned the overall loss is negative.
self.losses = - (tf.log(self.picked_action_probs) self.targets + 0.01 self.entropy)
Does this mean that the higher the entropy, the more negative the loss is? If it's more negative, then it will affect the gradient less, right?
I do not understand how adding entropy to loss will encourage exploration
I understand that Entropy is a measure of unpredictability, or measure of randomness.
H(X) = -Sum P(x) log(P(x))
While training, we want to reduce the loss. By adding the Entropy (of possible actions ) to the loss, we will reduce the Entropy too (reduce unpredictability)
When all actions have almost same probability, then Entropy will be high When one action has near 1 probability, Entropy will be low
In the beginning of training, almost all actions have same probability. After some training, some actions get higher probability (in the direction of getting more rewards), and entropy is reduced over time.
However, I am confused, how adding entropy to loss will encourage exploration?