danijar / dreamerv3

Mastering Diverse Domains through World Models
https://danijar.com/dreamerv3
MIT License
1.28k stars 219 forks source link

Question about (time) ordering of data / predictions for the continue predictor #65

Open PaulScemama opened 1 year ago

PaulScemama commented 1 year ago

Hi, I'm quite inexperienced regarding Reinforcement Learning so forgive me if my question is trivial :). I have a quick question about the continue predictor.

In a typical Gym environment with an agent following a random policy, I've seen things like

for _ in range(num_episodes):                                                          # 1
  # First observation of an episode                                                    # 2
  obs, info = gym_env.reset()                                                          # 3
                                                                                       # 4
  done = False                                                                         # 5
  while not done:                                                                      # 6   
    action = gym_env.action_space.sample()                                             # 7
    observation, reward, done, _, _ = gym_env.step(action)                             # 8

The continue predictor is supposed to predict whether an episode will terminate or not. How I see it, for each (non-episode initializing step; lines 7-8) we get

My question is: do we use $x_t$ to predict $c_t$? More specifically, does the stochastic posterior incorporate $x_t$ so that the "model state" (concatenation of deterministic state and stochastic state) is used to predict $c_t$?

Another way of asking the question: do we use the observation retrieved at the step that we also receive the continue flag, to predict the continue flag? I.e. in the line observation, reward, done, _, _ = gym_env.step(action), we incorporate the observation into the stochastic state to then help predict the done?

Thanks in advance!