Eclectic-Sheep / sheeprl

Distributed Reinforcement Learning accelerated by Lightning Fabric
https://eclecticsheep.ai
Apache License 2.0
303 stars 31 forks source link

Handling of environment resets #249

Closed PaulScemama closed 6 months ago

PaulScemama commented 6 months ago

Hi!

I had a question about how you handle environment resets.

I see that you only call an environment's reset method once. For example, in dreamer_v3.py, here.

In some environments, like gymnasium ones, the canonical usage is

for episode in tqdm(range(n_episodes)):
    obs, info = env.reset()
    done = False

    # play one episode
    while not done:
        action = agent.get_action(obs)
        next_obs, reward, terminated, truncated, info = env.step(action)

        # update the agent
        agent.update(obs, action, reward, terminated, next_obs)

        # update if the environment is done and the current obs
        done = terminated or truncated
        obs = next_obs

Here, any time done is True, an observation is queried from reset() to feed to the agent to determine an action for the beginning of a new episode.

My question is: how do you attain this first observation from reset() any time we're beginning a new episode? I can't seem to find where this ever happens in your code. My hunch is that you instead use the last observation (the one accompanied with done is True) as the first observation of the next episode, effectively in place of the one attained via reset() like in the code snippet I've provided above.

Thanks so much, and by the way very nice library!

belerico commented 6 months ago

Hi @PaulScemama, thank you for using our library! In every algorithm we always wrap our environments with a gymnasium.vector.SyncVectorEnv or gymnasium.vector.AsyncVectorEnv. As specified in gymnasium docs:

To prevent terminated environments waiting until all sub-environments have terminated or truncated, the vector environments autoreset sub-environments after they terminate or truncated. As a result, the final step’s observation and info are overwritten by the reset’s observation and info. Therefore, the observation and info for the final step of a sub-environment is stored in the info parameter, using “final_observation” and “final_info” respectively.

So we're always sure that we have the reset observation when an episode has terminated (or truncated). When we need the final observations we grab it from the info dictionary.

PaulScemama commented 6 months ago

Thank you @belerico! That totally makes sense