Question regarding gradient from Reinforce Criterion

facebookarchive / MIXER

Mixed Incremental Cross-Entropy REINFORCE ICLR 2016

Other

332 stars 86 forks source link

Question regarding gradient from Reinforce Criterion #4

Closed lifelongeek closed 8 years ago

lifelongeek commented 8 years ago

First of all, thanks for sharing source code for your awesome work. I am trying to apply MIXER objective function to my model.

I am asking two questions about output gradient formula from Reinforce Criterion. (formula (11) in paper: http://arxiv.org/abs/1511.06732)

Question1) To understand full derivation of gradient, paper recommends to see "Reinforcement learning neural turing machines". Can you clarify which formulas in reference paper (http://arxiv.org/abs/1505.00521) correspond to derivation of gradient?

Question2) As far as I understand, 'T' in formula (11) is length of sequence generated by RNN. (e.g. the number of tokens RNN generates until it outputs 'End of sentence' token (=)). When t = T in formula (11), how can we calculate r(T+1)? Here is My guess: r(T+1) comes at time step, where input of RNN is . Is this right?

I really appreciate your help. Thank you :)

ranzato commented 8 years ago

1) I was referring to the formulas in the middle of page 5 and page 12 of http://arxiv.org/pdf/1505.00521v3.pdf which essentially take the gradient w.r.t. to the expected reward. 2) Yes, T is the length of the generated sequence. I assume that the model will take T inputs (w_1...w_T) and predict T outputs (w2....w{T+1}) where w_{T+1} is the special end of sentence token. I realize that the notation may be confusing, hopefully the code clarifies these boundary conditions.

eriche2016 commented 8 years ago

Hi, i also a question in equation (11) section part. you say that when r > r{t+1}, the model encourages to select w{t+1}^g, what does this mean? can you give me some intuitive sense on this?

ranzato commented 8 years ago

Math says that the bias term in that equation is used to reduce the variance of the estimator of the true gradient (and the variance is large since we only sample one trajectory).

In addition to that, my interpretation is as follows: you can see the scalar (r - \bar(r)_{t+1}) as changing the sign of the (usual) cross-entropy gradient. In particular, if you observed a reward "r" greater than what you expected, then the sign is positive. This means that you reinforce the choice of words you made. Vice versa, if the reward is lower than what you expected, then the sign is negative - thereby discouraging the model from making that choice of words again. The excpected reward (which we train as an additional "head" in our RNN) takes the role of the "critique" (baseline you want to beat, but note that this baseline changes as you keep training your model).