Closed tejank10 closed 6 years ago
Is there any paper that I can consult?
neural network takes the difference between current state and the state one timestep before (s_t - s_t-1).
I will consider this is a part of feature extraction, or consider it as an internal state of a policy. This is my imagination:
memory_buffer = []
ϕ!(s) = ... # store state into memory_buffer and do feature extraction
ep = Episode(env, π)
for (s, a, r, s′) in ep
ϕ!(s)
...
end
or
ep = Episode(env, π)
for (s, a, r, s′) in ep
...
π.last_state = s
end
the difference between current state and the state one timestep before.
I have some time series application need to roll out time window for neural nets. My neural nets need
s_t - s_t-1
, s_t - s_t-2
... to s_t - s_t-n
as input. In this case, I did feature extraction first, create a
larger table, and make this new table as my environment. Thus, the state from new environment is a complete time window.
In case that we cannot determine s_t - s_t-1
first, I think my previous snippets are okay.
But I'm not sure the design philosophy in original paper, maybe we need change the design of action
.
Thanks for the reply!
I am not aware of paper having s_t - s_t-1
, but this code of 'Pong from Pixels' has it. I wanted to implement such a design for my implementation.
ep = Episode(env, π)
for (s, a, r, s′) in ep
...
π.last_state = s
end
In this case isn't the episode already over, where only s_t
was taken into account to predict action?
In this case isn't the episode already over, where only
s_t
was taken into account to predict action?
well, inside the for loop, the episode isn't over yet. The underlying implementation of Episode
is that it supports iteration protocol. You can manually iter it to check what's going on via start(...); next(...); next(...) ... etc
.
I think my snippet can be refined as following.
function action(π::MyPolicy, r, s)
a = π(s .- π.last_s) # do action selection stuffs
π.last_s = s
return a
end
ep = Episode(env, π)
for (s, a, r, s′) in ep
# ...
end
# episode end
When using epsilon-greedy methods to take action, neural network predicts which action to take based on input state. Recently, there have been developments wherein instead of one state (ie the current state), neural network takes the difference between current state and the state one timestep before (s_t - s_t-1). Or it may accept a set of states as input and predict an action.
I am guessing if such an
action
function has to be implemented, we need to modify the call toaction
. I am interested in developing this functionality. Can anyone point me in the right direction?