why should update rollout policy in this way?

LantaoYu / SeqGAN

Implementation of Sequence Generative Adversarial Nets with Policy Gradient

2.08k stars 711 forks source link

why should update rollout policy in this way? #29

Closed vanpersie32 closed 6 years ago

vanpersie32 commented 7 years ago

According to the paper, rollout policy is the same with generator policy. So self.Wi = self.lstm.Wi, but in the code, here update parameters of rollout policy in a different way. Can you please explain why? Thank you very much @LantaoYu @wnzhang

eduOS commented 6 years ago

I am also wondering why there should be a delay. But to make it the same as that in the paper you can just set the update_rate to 1.

zichaow commented 6 years ago

I also noticed this; the update for the rollout seems to take the form of a convex combination of the parameters from the rollout and the generator. Wonder what's the justification for such an update.

gcbanana commented 6 years ago

@eduOS To make it the same as that in the paper, why set the update_rate to 1? Shouldn't it set to be 0? self.Wi = self.update_rate self.Wi + (1 - self.update_rate) tf.identity(self.lstm.Wi) After one-step training of the generator, the lstm.Wi is changed, but self.Wi is not changed. If the rate is set to 1, self.Wi = self.Wi, it won't be changed. It makes me confused.

eduOS commented 6 years ago

@gcbanana You are right. @vanpersie32 I learned that this trick is a regularization method which is the so-called weight decay. I'd like you to see this: #21

lucaslingle commented 6 years ago

I had the same question.

I don't think that this is weight decay, because it's not being applied to the gradients, and it's not decaying the rollout network's weights towards zero. Rather, it's updating them in a way that maintains an exponential moving average of the generator network weights.

I recently found a reinforcement learning paper which did the same thing, in a different context. They said it improved the stability of the learning process.

In their case, they weren't using a rollout network, but the motivation here may be similar.

References: [1] https://www.tensorflow.org/api_docs/python/tf/train/ExponentialMovingAverage [2] https://arxiv.org/pdf/1509.02971.pdf

vanpersie32 commented 6 years ago

@lucaslingle You are right. Close the issue

vanpersie32 commented 6 years ago

This is a trick for stabilizing the training process, and setting the parameters of rollout to same with generator will degrade performance of seqgan.

eduOS commented 6 years ago

In face it is the same as L2 regularization. It keeps the weights small and hence stable as stated in other comments.