dennybritz / reinforcement-learning

Implementation of Reinforcement Learning Algorithms. Python, OpenAI Gym, Tensorflow. Exercises and Solutions to accompany Sutton's Book and David Silver's course.
http://www.wildml.com/2016/10/learning-reinforcement-learning/
MIT License
20.6k stars 6.04k forks source link

Gradient clipping in A3C #54

Open poweic opened 7 years ago

poweic commented 7 years ago

Hi there,

I noticed that even though policy net and value net share some parameters (in a3c/estimators.py), their gradient were clipped separately (in a3c/worker.py).

I was wondering whether that could be a problem (clip before add v.s. clip after add)

Suppose we clip gradient by norm at a threshold of 5.

local_grads, _ = tf.clip_by_norm(local_grads, 5.0)

(to make it simple, I choose clip_by_norm instead of clip_by_global_norm)

If for some shared parameter, gradient from policy net is +10 and gradient from value net is -7, the net gradient should be +10 -7 = +3 (no clipping needed). But if we clip before summing them up, it becomes +5 -5 = 0.

Thanks

dennybritz commented 7 years ago

Hmm, that's a good point. I am not sure what exactly is happening under the hood in Tensorflow here, but I would imagine that most the time the gradients are well within the boundary and this shouldn't have much of an adverse effect. But I think you're right, this may not be 100% correct.

It seems a bit ugly to fix this. I guess you would need to combine the two train ops by iterating through all gradients, add them, and then only clip the shared ones?

poweic commented 7 years ago

@dennybritz I totally agree. Could be ugly but maybe a chance to refactor. sorry to get back to you so late. I was busy implementing ACER, something like an off-policy version of A3C.

I think one way to do this is to add up losses from policy net and value net first, and then compute the gradient and then clip them. I guess that requires lots of changes in the whole architecture because PolicyEstimator and ValueEstimator are now separate classes.

My suggestion is that we merge PolicyEstimator and ValueEstimator into a single class, something like this:

def build_shared_network(input):
  ...
  return shared

def policy_network(shared):
  ...
  return mu, sigma

def value_network(shared):
  ...
  return logits

class Estimator():
  def __init__(self, ...):
    ...
    shared = build_shared_network(...)
    mu, sigma = policy_network(shared)
    logits = value_network(shared)

    self.pi_loss = ...
    self.vf_loss = ...
    self.loss = self.pi_loss + self.vf_loss - entropy

    if trainable:
      self.optimizer = ...
      self.grads_and_vars = self.optimizer.compute_gradients(self.loss)

This has several advantages:

But this is a big change and I'm sure whether that's a good idea.