LantaoYu / SeqGAN

Implementation of Sequence Generative Adversarial Nets with Policy Gradient
2.09k stars 711 forks source link

about the loss of generator #20

Closed wjb123 closed 7 years ago

wjb123 commented 7 years ago

Hi, I have read your code in generator.py (line 106-113),

Unsupervised Training

    self.g_loss = -tf.reduce_sum(
        tf.reduce_sum(
            tf.one_hot(tf.to_int32(tf.reshape(self.x, [-1])), self.num_emb, 1.0, 0.0) * tf.log(
                tf.clip_by_value(tf.reshape(self.g_predictions, [-1, self.num_emb]), 1e-20, 1.0)
            ), 1) * tf.reshape(self.rewards, [-1])
    )

I find the the variable (self.g_predictions and self.x) is the same as the variable in self.pretrain_loss, but since this is for the Unsupervised Training, the variable self.g_predictions should be replaced with the variable ( line 50, tf.nn.softmax(o_t) ). After all, the line 47-56 do not use the supervised information (self.x). Is there any reason ?

LantaoYu commented 7 years ago

Hi, this part is a little tricky. In REINFORCE, we first sample a trajectory according to the policy here and then send the sequence to discriminator to get reward here. After this, I need to feed back the reward signal to the generator and to reproduce the probability distribution for each step of generating that sequence (since the received reward is only for that specific sequence) I use a supervised style to feed in that sequence. For example, suppose I get a reward Q(e|{a,b,c,d}) and I have to reproduce G(e|{a,b,c,d}), then at each time step I should feed in {a,b,c,d}.

yooceii commented 4 years ago

Hi, I have a follow-up question. When we sampling the trajectory. Didn't we obtain the state-action transition probability at the same time? Why do you have to reproduce the probability later?