About the "baseline" of REINFORCE in RAM

Hi @nicholas-leonard ,

I am Ray, currently looking into your code recurrent-visual-attention.lua. It is a nice work! I am a little confused about this part:

-- add the baseline reward predictor seq = nn.Sequential() seq:add(nn.Constant(1,1)) seq:add(nn.Add(1)) concat = nn.ConcatTable():add(nn.Identity()):add(seq) concat2 = nn.ConcatTable():add(nn.Identity()):add(concat)

It seems that the baseline is a constant value (nn.Constant) plus some bias (nn.Add) in your implementation. But in the paper "Recurrent Modelof Visual Attention" page 5, they say that baseline "b_t = E_pi[R_t]". This is different from your code, right? Could you please give a little bit explanation about this difference?

Thank you!

Best, Ray

Element-Research / rnn

About the "baseline" of REINFORCE in RAM #398