Open poweic opened 7 years ago
Hmm, that's a good point. I am not sure what exactly is happening under the hood in Tensorflow here, but I would imagine that most the time the gradients are well within the boundary and this shouldn't have much of an adverse effect. But I think you're right, this may not be 100% correct.
It seems a bit ugly to fix this. I guess you would need to combine the two train ops by iterating through all gradients, add them, and then only clip the shared ones?
@dennybritz I totally agree. Could be ugly but maybe a chance to refactor. sorry to get back to you so late. I was busy implementing ACER, something like an off-policy version of A3C.
I think one way to do this is to add up losses from policy net and value net first, and then compute the gradient and then clip them. I guess that requires lots of changes in the whole architecture because PolicyEstimator and ValueEstimator are now separate classes.
My suggestion is that we merge PolicyEstimator and ValueEstimator into a single class, something like this:
def build_shared_network(input):
...
return shared
def policy_network(shared):
...
return mu, sigma
def value_network(shared):
...
return logits
class Estimator():
def __init__(self, ...):
...
shared = build_shared_network(...)
mu, sigma = policy_network(shared)
logits = value_network(shared)
self.pi_loss = ...
self.vf_loss = ...
self.loss = self.pi_loss + self.vf_loss - entropy
if trainable:
self.optimizer = ...
self.grads_and_vars = self.optimizer.compute_gradients(self.loss)
This has several advantages:
if trainable:
self.optimizer = tf.train.RMSPropOptimizer(0.00025, 0.99, 0.0, 1e-6)
...
net_train_op = make_train_op(self.net, self.global_net)
# self.vnet_train_op = make_train_op(self.value_net, self.global_value_net)
# self.pnet_train_op = make_train_op(self.policy_net, self.global_policy_net)
But this is a big change and I'm sure whether that's a good idea.
Hi there,
I noticed that even though policy net and value net share some parameters (in a3c/estimators.py), their gradient were clipped separately (in a3c/worker.py).
I was wondering whether that could be a problem (clip before add v.s. clip after add)
Suppose we clip gradient by norm at a threshold of 5.
(to make it simple, I choose
clip_by_norm
instead ofclip_by_global_norm
)If for some shared parameter, gradient from policy net is
+10
and gradient from value net is-7
, the net gradient should be+10 -7 = +3
(no clipping needed). But if we clip before summing them up, it becomes+5 -5 = 0
.Thanks