Closed gunnxx closed 1 year ago
Hi, running your example, for index i, we would get reward i, nobs i, obs i-1, act i. Note that the observations are also off-by-one: https://github.com/conglu1997/v-d4rl/blob/29d0960923b634a0b149d4312e18460d49fbeb08/drqbc/numpy_replay_buffer.py#L53.
Owh right my bad, I misunderstood the ordering of the action and reward as well (they are off-by-one). Thanks for the clarification!
Hi,
Upon reading the code, I think the n-step reward computation is wrong ie.
If we take an example of
self.frame_stack=1
andself.nstep=1
and lets sayindices[0] = 1
, supposedly the experiences are written as(s, a, r, s', a', r', s'')
, then the sampled experience will be(s, a, r', s')
instead of(s, a, r, s')
. The fix will beWhat do you think? Did I miss something?