dennybritz / reinforcement-learning

Implementation of Reinforcement Learning Algorithms. Python, OpenAI Gym, Tensorflow. Exercises and Solutions to accompany Sutton's Book and David Silver's course.
http://www.wildml.com/2016/10/learning-reinforcement-learning/
MIT License
20.48k stars 6.03k forks source link

A3C: We add entropy to the loss to encourage exploration #34

Closed IbrahimSobh closed 7 years ago

IbrahimSobh commented 7 years ago

I do not understand how adding entropy to loss will encourage exploration

I understand that Entropy is a measure of unpredictability, or measure of randomness.

H(X) = -Sum P(x) log(P(x))

While training, we want to reduce the loss. By adding the Entropy (of possible actions ) to the loss, we will reduce the Entropy too (reduce unpredictability)

When all actions have almost same probability, then Entropy will be high When one action has near 1 probability, Entropy will be low

In the beginning of training, almost all actions have same probability. After some training, some actions get higher probability (in the direction of getting more rewards), and entropy is reduced over time.

However, I am confused, how adding entropy to loss will encourage exploration?

fferreres commented 7 years ago

For an example see: https://openreview.net/pdf?id=Hk3mPK5gg

To reduce the correlation of game experience, Asynchronous Advantage Actor-Critic Model [Mnih et al. (2016)] runs independent multiple threads of the game environment in parallel. These game instances are likely uncorrelated, therefore their experience in combination would be less biased. For on-policy models, the same mutual reinforcement behavior will also lead to highly-peaked π(a|s) towards a few actions (or a few fixed action sequences), since it is always easy for both actor and critic to over-optimize on a small portion of the environment, and end up “living in their own realities”. To reduce the problem, [Mnih et al. (2016)] added an entropy term to the loss to encourage diversity, which we find to be critical.

Intuitively, it adds more cost to actions that too quickly dominate, and the higher cost favors more exploration (on top of the random e-greediness).

IbrahimSobh commented 7 years ago

Thank you very much fferreres

Excellent explanation. However:

1- First point:

You have mentioned that:

"it adds more cost to actions that too quickly dominate"

  • When we have actions with almost equal probabilities, the added cost (entropy over actions) will be high.
  • When we have some dominant action, with higher probability than other actions, the added cost (entropy over actions) will be low.

So, the added cost will be large for random actions?! (not dominant actions) and the added cost will be small in case of dominant action?!

the corresponding code:

self.entropy = -tf.reduce_sum(self.probs * tf.log(self.probs), 1, name="entropy") ... self.losses = - (tf.log(self.picked_action_probs) * self.targets + 0.01 * self.entropy) self.loss = tf.reduce_sum(self.losses, name="loss")

(Sorry, I am confused)

2- Second point:

As I understand, in case of A3C, we do not use e-greedy, only adding entropy for more exploration.

3- Last point:

Regarding re-producing the results of the paper https://github.com/dennybritz/reinforcement-learning/issues/30 I've mentioned that

I think β was 0.01 in the code as in the paper. However, Is it possible, for some reason, that the agent converged to suboptimal policy?

In other words, as mentioned in the paper you shared:

since it is always easy for both actor and critic to over-optimize on a small portion of the environment, and end up “living in their own realities”

Do you think this could be the reason that results peak at score ~35?

Thank you

fferreres commented 7 years ago

Entropy = Different

Do you think this could be the reason that results peak at score ~35?

I don't. It helps but even cross-entropy only helps partially. I really think it's a sample size thing. For example look at page 530, how the rewards plateau in the right top figure. Each epoch is 520k frames (steps) and between epoch 60 to 140 the scores improve little or nothing on average. That's 80 epochs or over 40 million steps....before getting much better.

https://storage.googleapis.com/deepmind-data/assets/papers/DeepMindNature14236Paper.pdf

Also on page 12 of pdf they show that regular dqn without replay and separate target network gets below 11 score in breakout (after training for much longer) and linear is around 2 score. So getting to 35 after 2M steps seems normal, we still can't know much about the performance. We just need a computer with a fast gpu like a 1070 at least for 2 days.

IbrahimSobh commented 7 years ago

Thank you

I have limited access to Amazon g2.2xlarge instance https://aws.amazon.com/ec2/instance-types/ It finished 1 million frames (steps) in around 4 hours (A3C with 8 parallel agents). However, we need to train over 40 million steps which means around 160 hours! (too much time and too much money) :(

Regarding the entropy, I generally understand the idea of adding more cost. However it seems reversed for me! (when entropy is high or low as I elaborated above)

I hope @dennybritz (as usual) can make it clearer for me :)

Thanks again

fferreres commented 7 years ago

Yeah, but for A3C I think(this implementation) isn't using GPU (which changes the algorithm a bit to build a queue and update gradients centrally), so g2.2xlarge expensive and not very efficient. Better an instance with 8 CPU and hyper-threading for 16+ workers (much cheaper).

Regarding the entropy, I went and checked what different a3c algorithms implement, and it's what you say, and in fact, I think likewise it doesn't match what is explained in any of the papers. In particular, entropy (eg. Cost) is maximal when probability of the actions is uniform, and tends to zero when further they are from uniform. So a gradient based on that will punish uniform. It encourages changes in params in the direction AWAY from uniform, and can't thus be saying as many papers say, including the A3C explanation:

We also found that adding the entropy of the policy π to the objective function improved exploration by discouraging premature convergence to suboptimal deterministic policies.

But that can't happen unless Google doing something different than what they imply in the paper or at least how it's interpreted.

The original paper everyone cites is this one - although it's missing figures for some reason: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.54.3433&rep=rep1&type=pdf

In it the authors explain that ...

The use of entropy maximization is designed to help keep the search alive by preventing convergence to a single choice of output especially when several choices all lead to roughly the same reinforcement value.

Now, based on the paper, which I clearly can't make full sense of, it appears to say that (my interpretation follows) in certain states, there may make most sense to stay close to a high entropy probability (closer to uniform), if for example those paths all lead to equally good outcomes down the road.

(My explanation) For example, suppose there's an Y junction at a path, each leading to the same place, then very early conversion will stick to left or right, and reinforce that path, but those two paths are equally good really. In this way, early convergence discourages exploration, and since we don't knoll states, likely, we may be missing something that made the other path better. So it'd be better for the algorithm to actually stay close to uniform there to avoid local optima.

It also seems that the entropy equations and what the paper actually states is not the same as being done in the context of the current networks and setups (for many reasons from the paper). Most of the conversation is about REINFORCE, and the regularization being equivalent in a sense to bias regularization. The math does not appear to say to use it as in the current context.

But in the current A3C, I also fail to see how the current implementation actually helps at all, especially in the direction of what it's being attempted (encourage exploration). The current implementations obviously use entropy as in:

Entropy

... but which will vary depending on the number of actions and log used (Log2 of 2-actions will have 1 as max). I am not sure how to know entropy of a state in very large state spaces with function approximation. Maybe just penalize deviations from uniform? But it would be nice to actually penalize convergence at Y-junctions or states with actions that lead to similar rewards, or where rewards are more uncertain, especially early on. Even something like the negative of what is currently done may be better that the opposite (just let the cost of the regularization be negative - as horrible it may be, that'd encourage exploration).

In partial observability, a great advantage of policy gradient methods is that it directly models action probabilities, which allow a trained agent to bluff in poker (the right proportion of times). But given the current regularization, it doesn't seem to be helping avoid convergence on places where outcome uncertainly is intrinsically high (eg. choices that are equally good or bad), and the current entropy regularization does not help; only simulated anealing greediness is helping it, and if you are unlucky with it, you'll be unlucky for the entire exercise.

So, now I see I have the same question you have, and hope we are not upsetting Danny with deviating from the code to start a conversation about algorithms in general :-) (Danny, let us know if not proper)

IbrahimSobh commented 7 years ago

Dear fferreres

Your elaboration above is exactly what I meant

"In particular, entropy (eg. Cost) is maximal when probability of the actions is uniform, and tends to zero when further they are from uniform. So a gradient based on that will punish uniform. It encourages changes in params in the direction AWAY from uniform, and can't thus be saying as many papers say ... But that can't happen unless Google doing something different ...."

I am happy that we both have the same confusion about how exactly adding entropy to loss will encourage exploration

In the paper you mentioned:

The use of entropy maximization is designed to help keep the search alive by preventing convergence to a single choice of output

entropy maximization = encourage uniform action probability
I was thinking:

Regarding your question:

I am not sure how to know entropy of a state in very large state spaces with function approximation

I think, and guided by the A3C code, the entropy is calculated for state by just applying the entropy equation on all actions of that state. More specifically, if we are in state (s) and using the policy estimator (using function Neural Network as approximator), then we know probability of all actions of that state. Then simply we calculate the entropy as follows:

H(X) = - Sum over all actions [P(action) log(P(action))]

This happens independent of the state or actions or rewards. It just happens all the time.

Finally, I would like to thank @dennybritz for his great work and educational blog, that enabled us to understand the basics and even implement and discuss details of advanced algorithms.

fferreres commented 7 years ago

Yes, I know how it's being calculated as I looked into the code. The reference is that the context of the original paper is not one with function approximation. Also, there are these things that "confuse me" about the paper:

  1. It defines J as a function to MAXIMIZE early in the paper. And the H() function should thus be CHANGED to negative if we are minimizing. But everyone takes the J proposed on equation (30) of the paper verbatim and then MINIMIZES. Which is clearly wrong since the paper states clearly that along the entire paper J is to be Maximized. This explains why most just add the term like in equation (30) when they should include -H()
  2. The paper, on that J proposal (but with caveat that they mention to have no hidden units) refers to equation 27. While the EXPECTATION of entropy is p*log(p), for the J function to be updated stochastically, they propose H(p) = - ln(p). You can also use 28 or 29 which just uses p...and H proposed the same. The use the expectation early in the paper to say how the update (in the context they define it) is unbiased estimator of the entropy.

Look at equation 30, then equation 28, then look early in the paper the reference to J as a function built with maximization in mind.

It would be nice to pick a much simpler RL problem, and comparing average rewards for current algorithm as-is, then inverting the sign (which is what the actual paper says due to J as maximization problem), then see the same but use just H as SUM(-ln P(Ai))

I wish I was more proficient in this, so I could just pick a simpler problem (but non trivial) and do the test myself and see if it makes a difference in any direction empirically. But it'd also be nice if someone looked at the paper that every A3C paper references, to actually check the rule of thumb of the entropy term everyone is adding that is likely not doing what they describe (and that you noted first)

fferreres commented 7 years ago

And now we have the UNREAL extension to a3c. Denny if you find any sites explaining it more easily would be great to add to the list of algorithms

fferreres commented 7 years ago

The UNREAL paper substract the entropy, page 3:

https://arxiv.org/pdf/1611.05397.pdf

Just changing the sign should improve a3c here. Still confused about original paper's proposed update of gradient (just H = -ln( . )).

aistrych commented 7 years ago

You have overlooked the minus sign before the whole equation I think: self.losses = - (tf.log(self.picked_action_probs) self.targets + 0.01 self.entropy)

As you know it is equivalent to: self.losses = -tf.log(self.picked_action_probs) self.targets - 0.01 self.entropy

In the original paper Williams et. al. where maximising expected reward. In DeepMind implementation Mnih et. al. have decided to use minimisation algorithm, so they added minus before the whole equation.

fferreres commented 7 years ago

@Astrych thanks for pointing it out. I had missed how it is already negative on first look. Wish you were here two weeks ago :-) Thanks!

IbrahimSobh commented 7 years ago

Thank you so much:
@Astrych @fferreres

Do you mean that:

I mentioned above that:

"What will happen if we subtract entropy from loss (instead of adding it), to encourage "entropy maximization" not minimization, in order to encourage uniform distribution?"

I want to close this issue after we all agree on something very clearly :)

Finally, I wonder what will happen if we just used e-greedy approach? (select random action at probability e and select action using policy estimator at probability 1-e)

Thank you

fferreres commented 7 years ago

We want to minimize the loss

Yes

We actually subtract entropy to loss (not adding it?) which means we encourage high entropy?

Yes

I want to close this issue after we all agree on something very clearly :)

We all agree (so far)

IbrahimSobh commented 7 years ago

Thank you!

I wonder what will happen if we neglect the entropy thing and just use e-greedy approach? (select random action at probability e and select action using policy estimator at probability 1-e)

I mean instead of adding entropy.

deependersingla commented 7 years ago

Hello @IbrahimSobh

Late to the party. Did you try e-greedy approach? Any improvements/results?

the0demiurge commented 6 years ago

image According to the Asynchronous Methods for Deep Reinforcement Learning, the entropy was added to the policy gradient, which means we want to maximize the log(p) and maximize entropy as well.

akaniklaus commented 5 years ago

Hello, In case of discrete ones, basically high entropy would lead to more random actions. I would like to ask how one calculate the entropy in case of continuous action spaces.

Also, doesn't this have a trade-off of encouraging less certain decisions?

Finally, what do you think is the relationship with adding noise to the output during training, as that would also encourage trying more random actions I guess.

shtse8 commented 4 years ago

Hello @IbrahimSobh

Late to the party. Did you try e-greedy approach? Any improvements/results?

Late to the party too. Did anyone try e-greedy approach? I don't understand why we need entropy instead of egreedy for exploration.

mimoralea commented 4 years ago

The main issue is egreedy is an off-policy exploration strategy; the exploration is added on top of a policy. You have certain behavioral policy and every epsilon you act uniformly at random, instead of choosing the action the policy prescribes.

A3C is an on-policy algorithm for the most part (workers getting out of sync makes data a bit off-policy), and it won't deal very well with off-policy data. Notice how entropy promotes an exploration strategy that is part of the stochastic policy, it's baked into the A3C policy. Entropy is just encouraging the policy to stay stochastic, but it is not changing the policy.

Flock1 commented 2 years ago

@mimoralea, how will entropy not affect the policy? We are using entropy in the loss function. So won't that have an effect on the overall policy?

@Astrych, here you mentioned the overall loss is negative.

self.losses = - (tf.log(self.picked_action_probs) self.targets + 0.01 self.entropy)

Does this mean that the higher the entropy, the more negative the loss is? If it's more negative, then it will affect the gradient less, right?