Reward trajectory is one-off

romue404 commented 2 years ago

Hey there, i find it somewhat counterintuitive that the framework uses a default reward at t=0 of 0 (see gyma.py line 279 & 292). Note that the gym interface only returns the initial state on reset (https://github.com/openai/gym/blob/103b7633f564a60062a25cc640ed1e189e99ddb7/gym/core.py#L8). Isn't it more common to assume that r_t = R(s_t, a_t) and consequently r_t is the outcome of \pi(st)? Currently, r{t+1} is the outcome of \pi(s_t). In the A2C example this leads to some confusion where reward[1:] is the reward at t and critic[1:] the state value at t+1 (but both use a 1)

  target = reward[1:] + cfg.algorithm.discount_factor * critic[1:].detach() * (1 - done[1:].float())

Best regards

edit: Fig. 13 & Fig. 14 in the ArXiv Paper use set.get(...) , i believe it should be self.get(...) :-)

ludc commented 2 years ago

Hi, SaLinA considers that the reward (at t-1) is provided to the agent 'inside' the observation at time t, thus generating this effect. From an 'agent ' point of view, it seems quite intuitive: the agent reads at time 't-1' and produces information at time 't'. But I agree that the other choice could have been made. The critical point is certainly to make this appear in the GymAgent documentation.

ludc commented 2 years ago

Anyway, changing this implementation aspect would impact almost all the provided algorithms :(

facebookresearch / salina

Reward trajectory is one-off #27