Closed courseprojects closed 4 years ago
Hi @courseprojects ,
You're correct -- reward() only takes action
explicitly, but because the reward()
funciton is part of the environment, it already has access to the entire state -- so hopefully this shouldn't be a problem for you if you choose to extend the reward function. If you do end up creating a generalizable hindsight reward function, we'd be apply to merge it into our repo -- please see our contributions guide if so!
You're also correct that tanh has a range of [-1, 1]. However, our distance values are always positive, so our effective range is [0, 1].
Hope that helps! Closing this issue for now, but feel free to re-open it if you feel that there are still some unanswered questions.
For hindsight experience replay use cases, one needs a way to compute reward for the "hallucinated" new goal. This calculation will need not only action but also current state. In some of the
reward()
implementations I have seen (e.g. Lift), thereward()
only takesaction
as input. Could we have something similar to OpenAI'scompute_reward()
function?Another issue is the sparse reward in Lift. the
reaching_reward
is computed by1 - np.tanh(10.0 * dist)
, but this has range [0, 2] because range oftanh
is [-1, 1]. Should this be updated?