Gradient clipping in A3C

poweic commented 7 years ago

Hi there,

I noticed that even though policy net and value net share some parameters (in a3c/estimators.py), their gradient were clipped separately (in a3c/worker.py).

I was wondering whether that could be a problem (clip before add v.s. clip after add)

Suppose we clip gradient by norm at a threshold of 5.

local_grads, _ = tf.clip_by_norm(local_grads, 5.0)

(to make it simple, I choose clip_by_norm instead of clip_by_global_norm)

If for some shared parameter, gradient from policy net is +10 and gradient from value net is -7, the net gradient should be +10 -7 = +3 (no clipping needed). But if we clip before summing them up, it becomes +5 -5 = 0.

Thanks

dennybritz commented 7 years ago

Hmm, that's a good point. I am not sure what exactly is happening under the hood in Tensorflow here, but I would imagine that most the time the gradients are well within the boundary and this shouldn't have much of an adverse effect. But I think you're right, this may not be 100% correct.

It seems a bit ugly to fix this. I guess you would need to combine the two train ops by iterating through all gradients, add them, and then only clip the shared ones?

poweic commented 7 years ago

@dennybritz I totally agree. Could be ugly but maybe a chance to refactor. sorry to get back to you so late. I was busy implementing ACER, something like an off-policy version of A3C.

I think one way to do this is to add up losses from policy net and value net first, and then compute the gradient and then clip them. I guess that requires lots of changes in the whole architecture because PolicyEstimator and ValueEstimator are now separate classes.

My suggestion is that we merge PolicyEstimator and ValueEstimator into a single class, something like this:

def build_shared_network(input):
  ...
  return shared

def policy_network(shared):
  ...
  return mu, sigma

def value_network(shared):
  ...
  return logits

class Estimator():
  def __init__(self, ...):
    ...
    shared = build_shared_network(...)
    mu, sigma = policy_network(shared)
    logits = value_network(shared)

    self.pi_loss = ...
    self.vf_loss = ...
    self.loss = self.pi_loss + self.vf_loss - entropy

    if trainable:
      self.optimizer = ...
      self.grads_and_vars = self.optimizer.compute_gradients(self.loss)

This has several advantages:

we don't need to pass "reuse" argument to build_shared_network anymore

need only 1 optimizer instead of 2 in separate classes

if trainable:
self.optimizer = tf.train.RMSPropOptimizer(0.00025, 0.99, 0.0, 1e-6)
...

need only 1 make_train_op()

net_train_op = make_train_op(self.net, self.global_net)
# self.vnet_train_op = make_train_op(self.value_net, self.global_value_net)
# self.pnet_train_op = make_train_op(self.policy_net, self.global_policy_net)

But this is a big change and I'm sure whether that's a good idea.

dennybritz / reinforcement-learning

Gradient clipping in A3C #54