conglu1997 / v-d4rl

Challenges and Opportunities in Offline Reinforcement Learning from Visual Observations
MIT License
94 stars 9 forks source link

Potential bug in `EfficientReplayBuffer` #8

Closed gunnxx closed 1 year ago

gunnxx commented 1 year ago

Hi,

Upon reading the code, I think the n-step reward computation is wrong ie.

def gather_nstep_indices(self, indices):
    n_samples = indices.shape[0]
    all_gather_ranges = np.stack([np.arange(indices[i] - self.frame_stack, indices[i] + self.nstep)
                                    for i in range(n_samples)], axis=0) % self.buffer_size
    gather_ranges = all_gather_ranges[:, self.frame_stack:]  # bs x nstep
    obs_gather_ranges = all_gather_ranges[:, :self.frame_stack]
    nobs_gather_ranges = all_gather_ranges[:, -self.frame_stack:]

    all_rewards = self.rew[gather_ranges]

    # Could implement below operation as a matmul in pytorch for marginal additional speed improvement
    rew = np.sum(all_rewards * self.discount_vec, axis=1, keepdims=True)

If we take an example of self.frame_stack=1 and self.nstep=1 and lets say indices[0] = 1, supposedly the experiences are written as (s, a, r, s', a', r', s''), then the sampled experience will be (s, a, r', s') instead of (s, a, r, s'). The fix will be

gather_ranges = all_gather_ranges[:, self.frame_stack-1:-1]  # bs x nstep

What do you think? Did I miss something?

conglu1997 commented 1 year ago

Hi, running your example, for index i, we would get reward i, nobs i, obs i-1, act i. Note that the observations are also off-by-one: https://github.com/conglu1997/v-d4rl/blob/29d0960923b634a0b149d4312e18460d49fbeb08/drqbc/numpy_replay_buffer.py#L53.

gunnxx commented 1 year ago

Owh right my bad, I misunderstood the ordering of the action and reward as well (they are off-by-one). Thanks for the clarification!