Closed wjb123 closed 7 years ago
Hi, this part is a little tricky. In REINFORCE, we first sample a trajectory according to the policy here and then send the sequence to discriminator to get reward here. After this, I need to feed back the reward signal to the generator and to reproduce the probability distribution for each step of generating that sequence (since the received reward is only for that specific sequence) I use a supervised style to feed in that sequence. For example, suppose I get a reward Q(e|{a,b,c,d}) and I have to reproduce G(e|{a,b,c,d}), then at each time step I should feed in {a,b,c,d}.
Hi, I have a follow-up question. When we sampling the trajectory. Didn't we obtain the state-action transition probability at the same time? Why do you have to reproduce the probability later?
Hi, I have read your code in generator.py (line 106-113),
Unsupervised Training
I find the the variable (self.g_predictions and self.x) is the same as the variable in self.pretrain_loss, but since this is for the Unsupervised Training, the variable self.g_predictions should be replaced with the variable ( line 50, tf.nn.softmax(o_t) ). After all, the line 47-56 do not use the supervised information (self.x). Is there any reason ?