Hi, I'm quite inexperienced regarding Reinforcement Learning so forgive me if my question is trivial :). I have a quick question about the continue predictor.
In a typical Gym environment with an agent following a random policy, I've seen things like
for _ in range(num_episodes): # 1
# First observation of an episode # 2
obs, info = gym_env.reset() # 3
# 4
done = False # 5
while not done: # 6
action = gym_env.action_space.sample() # 7
observation, reward, done, _, _ = gym_env.step(action) # 8
The continue predictor is supposed to predict whether an episode will terminate or not. How I see it, for each (non-episode initializing step; lines 7-8) we get
an action | $a_t$
a reward resulting from the action | $r_t$
a "next" observation as a result of the action | $x_t$
a "done" (or alternatively continue) flag indicating if the episode has terminated | $c_t$
My question is: do we use $x_t$ to predict $c_t$? More specifically, does the stochastic posterior incorporate $x_t$ so that the "model state" (concatenation of deterministic state and stochastic state) is used to predict $c_t$?
Another way of asking the question: do we use the observation retrieved at the step that we also receive the continue flag, to predict the continue flag? I.e. in the line observation, reward, done, _, _ = gym_env.step(action), we incorporate the observation into the stochastic state to then help predict the done?
Hi, I'm quite inexperienced regarding Reinforcement Learning so forgive me if my question is trivial :). I have a quick question about the continue predictor.
In a typical Gym environment with an agent following a random policy, I've seen things like
The continue predictor is supposed to predict whether an episode will terminate or not. How I see it, for each (non-episode initializing step; lines
7-8
) we getMy question is: do we use $x_t$ to predict $c_t$? More specifically, does the stochastic posterior incorporate $x_t$ so that the "model state" (concatenation of deterministic state and stochastic state) is used to predict $c_t$?
Another way of asking the question: do we use the observation retrieved at the step that we also receive the continue flag, to predict the continue flag? I.e. in the line
observation, reward, done, _, _ = gym_env.step(action)
, we incorporate theobservation
into the stochastic state to then help predict thedone
?Thanks in advance!