Closed JoostvDoorn closed 5 years ago
As mentioned in your paper the cumulative future reward is used to update the policy at timestep t. My understand is that this is done at the following line inside the compute_sentence_bleu function.
https://github.com/apeterswu/RL4NMT/blob/2c8741c4088e6b1da893e36a305a8598fd16659e/tensor2tensor/utils/bleu_hook.py#L226
It seems that delta_results[::-1] reverses the batch dimension instead of the time dimension. Shouldn't this be?:
delta_results[::-1]
delta_results = delta_results[:, ::-1].cumsum(axis=1)[:, ::-1]
Thanks for your detailed checking, I will have a check on my running code on the server. May update the results.
Thanks again. When fixing this error, the results seem to have no big difference.
As mentioned in your paper the cumulative future reward is used to update the policy at timestep t. My understand is that this is done at the following line inside the compute_sentence_bleu function.
https://github.com/apeterswu/RL4NMT/blob/2c8741c4088e6b1da893e36a305a8598fd16659e/tensor2tensor/utils/bleu_hook.py#L226
It seems that
delta_results[::-1]
reverses the batch dimension instead of the time dimension. Shouldn't this be?: