Question about the cumulative reward

As mentioned in your paper the cumulative future reward is used to update the policy at timestep t. My understand is that this is done at the following line inside the compute_sentence_bleu function.

https://github.com/apeterswu/RL4NMT/blob/2c8741c4088e6b1da893e36a305a8598fd16659e/tensor2tensor/utils/bleu_hook.py#L226

It seems that delta_results[::-1] reverses the batch dimension instead of the time dimension. Shouldn't this be?:

delta_results = delta_results[:, ::-1].cumsum(axis=1)[:, ::-1]

apeterswu / RL4NMT

Question about the cumulative reward #3