Open riiswa opened 9 months ago
It seems correct to me. At least in the current main.
What may confuses is, that it does the +1 before calculating, but that's needed because the formula requires a start from 1. We could therefore make pre_count=1
and the +1 after the calculation, but the outcome is the same.
(Disclaimer: I'm not a maintainer of this repo)
ActionBonus
is a wrapper that adds an exploration bonus to less visited (state,action) pairs. Regarding the code source, we make the transition to the new state s{t+1} before updating the counter of (s{t+1}, a_t). Shouldn't we update the counter of (s_t, a_t) instead (i.e. perform the step in the environment after the update)?I may have misunderstood the purpose of the wrapper, but depending on your response I may submit a PR.