Closed romue404 closed 2 years ago
Hi, SaLinA considers that the reward (at t-1) is provided to the agent 'inside' the observation at time t, thus generating this effect. From an 'agent ' point of view, it seems quite intuitive: the agent reads at time 't-1' and produces information at time 't'. But I agree that the other choice could have been made. The critical point is certainly to make this appear in the GymAgent documentation.
Anyway, changing this implementation aspect would impact almost all the provided algorithms :(
Hey there, i find it somewhat counterintuitive that the framework uses a default reward at t=0 of 0 (see gyma.py line 279 & 292). Note that the gym interface only returns the initial state on reset (https://github.com/openai/gym/blob/103b7633f564a60062a25cc640ed1e189e99ddb7/gym/core.py#L8). Isn't it more common to assume that r_t = R(s_t, a_t) and consequently r_t is the outcome of \pi(st)? Currently, r{t+1} is the outcome of \pi(s_t). In the A2C example this leads to some confusion where reward[1:] is the reward at t and critic[1:] the state value at t+1 (but both use a 1)
Best regards
edit: Fig. 13 & Fig. 14 in the ArXiv Paper use
set.get(...)
, i believe it should beself.get(...)
:-)