apeterswu / RL4NMT

Reinforcement Learning for Neural Machine Translation
187 stars 48 forks source link

Question about the cumulative reward #3

Closed JoostvDoorn closed 5 years ago

JoostvDoorn commented 6 years ago

As mentioned in your paper the cumulative future reward is used to update the policy at timestep t. My understand is that this is done at the following line inside the compute_sentence_bleu function.

https://github.com/apeterswu/RL4NMT/blob/2c8741c4088e6b1da893e36a305a8598fd16659e/tensor2tensor/utils/bleu_hook.py#L226

It seems that delta_results[::-1] reverses the batch dimension instead of the time dimension. Shouldn't this be?:

delta_results = delta_results[:, ::-1].cumsum(axis=1)[:, ::-1]
apeterswu commented 6 years ago

Thanks for your detailed checking, I will have a check on my running code on the server. May update the results.


Thanks again. When fixing this error, the results seem to have no big difference.