Farama-Foundation / Gymnasium

An API standard for single-agent reinforcement learning environments, with popular reference environments and related utilities (formerly Gym)
https://gymnasium.farama.org
MIT License
6.93k stars 770 forks source link

[Bug Report] AtariPreprocessing Wrapper terminal_on_life_loss problem #166

Closed sdpkjc closed 1 year ago

sdpkjc commented 1 year ago

Describe the bug

When using AtariPreprocessing Wrapper, terminal_on_life_loss is set to true, there is a problem that the entire Atari game cannot be run. Instead of moving on to the next life, the game will reset completely when it runs out of life.

I observe that AtariPreprocessing Wrapper stores information about the number of lives in the environment, but does not use it. We may need to refer to EpisodicLifeEnv Wrapper in stable-baselines3 to modify reset func of AtariPreprocessing Wrapper.

Code example

import gymnasium as gym

def make_env():
    def thunk() -> gym.Env:
        env = gym.make("ALE/Breakout-v5")
        env = gym.wrappers.AtariPreprocessing(
            env,
            frame_skip=1,
            terminal_on_life_loss=True,
        )
        return env

    return thunk

envs = gym.vector.SyncVectorEnv([make_env()])
obs, _ = envs.reset()

while True:
    act = envs.action_space.sample()
    next_obs, reward, terminated, truncated, infos = envs.step(act)
    obs = next_obs
    if "final_info" in infos.keys():
        print(infos["final_info"])

Output:

[{'lives': 4, 'episode_frame_number': 144, 'frame_number': 144}]
[{'lives': 4, 'episode_frame_number': 256, 'frame_number': 400}]
[{'lives': 4, 'episode_frame_number': 180, 'frame_number': 580}]
[{'lives': 4, 'episode_frame_number': 104, 'frame_number': 684}]
[{'lives': 4, 'episode_frame_number': 124, 'frame_number': 808}]
[{'lives': 4, 'episode_frame_number': 200, 'frame_number': 1008}]
[{'lives': 4, 'episode_frame_number': 344, 'frame_number': 1352}]
[{'lives': 4, 'episode_frame_number': 172, 'frame_number': 1524}]
# ...

System info

Additional context

No response

Checklist

pseudo-rnd-thoughts commented 1 year ago

Could you explain the issue with more detail, to me this is the point of the parameter that if an agent loss a life in the environment then the game ends. Therefore, the output that you show is what is expected with this parameter enabled.

It is noted in the AtariPreprocessing wrapper that it follows Machado et al. (2018), "Revisiting the Arcade Learning Environment: Evaluation Protocols and Open Problems for General Agents" not the original DQN parameters

sdpkjc commented 1 year ago

The expected output:

[{'lives': 4, 'episode_frame_number': 152, 'frame_number': 152}]
dict_keys(['lives', '_lives', 'episode_frame_number', '_episode_frame_number', 'frame_number', '_frame_number', 'final_observation', '_final_observation', 'final_info', '_final_info'])
[{'lives': 3, 'episode_frame_number': 268, 'frame_number': 268}]
dict_keys(['lives', '_lives', 'episode_frame_number', '_episode_frame_number', 'frame_number', '_frame_number', 'final_observation', '_final_observation', 'final_info', '_final_info'])
[{'lives': 2, 'episode_frame_number': 364, 'frame_number': 364}]
dict_keys(['lives', '_lives', 'episode_frame_number', '_episode_frame_number', 'frame_number', '_frame_number', 'final_observation', '_final_observation', 'final_info', '_final_info'])
[{'lives': 1, 'episode_frame_number': 580, 'frame_number': 580}]
dict_keys(['lives', '_lives', 'episode_frame_number', '_episode_frame_number', 'frame_number', '_frame_number', 'final_observation', '_final_observation', 'final_info', '_final_info'])
[{'lives': 0, 'episode_frame_number': 689, 'frame_number': 689}]
dict_keys(['lives', '_lives', 'episode_frame_number', '_episode_frame_number', 'frame_number', '_frame_number', 'final_observation', '_final_observation', 'final_info', '_final_info'])
[{'lives': 4, 'episode_frame_number': 224, 'frame_number': 913}]
dict_keys(['lives', '_lives', 'episode_frame_number', '_episode_frame_number', 'frame_number', '_frame_number', 'final_observation', '_final_observation', 'final_info', '_final_info'])
[{'lives': 3, 'episode_frame_number': 332, 'frame_number': 1021}]
dict_keys(['lives', '_lives', 'episode_frame_number', '_episode_frame_number', 'frame_number', '_frame_number', 'final_observation', '_final_observation', 'final_info', '_final_info'])
[{'lives': 2, 'episode_frame_number': 460, 'frame_number': 1149}]
dict_keys(['lives', '_lives', 'episode_frame_number', '_episode_frame_number', 'frame_number', '_frame_number', 'final_observation', '_final_observation', 'final_info', '_final_info'])
[{'lives': 1, 'episode_frame_number': 956, 'frame_number': 1645}]
dict_keys(['lives', '_lives', 'episode_frame_number', '_episode_frame_number', 'frame_number', '_frame_number', 'final_observation', '_final_observation', 'final_info', '_final_info'])
[{'lives': 0, 'episode_frame_number': 1252, 'frame_number': 1941}]
dict_keys(['lives', '_lives', 'episode_frame_number', '_episode_frame_number', 'frame_number', '_frame_number', 'final_observation', '_final_observation', 'final_info', '_final_info'])
[{'lives': 4, 'episode_frame_number': 184, 'frame_number': 2125}]
dict_keys(['lives', '_lives', 'episode_frame_number', '_episode_frame_number', 'frame_number', '_frame_number', 'final_observation', '_final_observation', 'final_info', '_final_info'])
[{'lives': 3, 'episode_frame_number': 292, 'frame_number': 2233}]
dict_keys(['lives', '_lives', 'episode_frame_number', '_episode_frame_number', 'frame_number', '_frame_number', 'final_observation', '_final_observation', 'final_info', '_final_info'])
[{'lives': 2, 'episode_frame_number': 388, 'frame_number': 2329}]
dict_keys(['lives', '_lives', 'episode_frame_number', '_episode_frame_number', 'frame_number', '_frame_number', 'final_observation', '_final_observation', 'final_info', '_final_info'])
[{'lives': 1, 'episode_frame_number': 484, 'frame_number': 2425}]
dict_keys(['lives', '_lives', 'episode_frame_number', '_episode_frame_number', 'frame_number', '_frame_number', 'final_observation', '_final_observation', 'final_info', '_final_info'])
[{'lives': 0, 'episode_frame_number': 589, 'frame_number': 2530}]
dict_keys(['lives', '_lives', 'episode_frame_number', '_episode_frame_number', 'frame_number', '_frame_number', 'final_observation', '_final_observation', 'final_info', '_final_info'])
[{'lives': 4, 'episode_frame_number': 156, 'frame_number': 2686}]
# ...
sdpkjc commented 1 year ago

This parameter is not recommended in Machado et al. (2018). It is a legacy parameter. This method was first used in Nature DQN. I looked at the literature and understood that it meant returning a termination signal at the end of life, not really terminating the environment completely at the end of life. Both Cleanrl and sb3 are implemented in this way as well.

You can also see that the AtariPreprocessing wrapper contains the self.lives and self.game_over attributions, but neither is used. This means that it was intended to implement this feature, but somehow it wasn't fully implemented, which could be a bug.

sdpkjc commented 1 year ago

atari_preprocessing.py L24-25:

    - Termination signal when a life is lost: When the agent losses a life during the environment, then the environment is terminated.
        Turned off by default. Not recommended by Machado et al. (2018).

Here, I think the first half sentence(Termination signal when a life is lost) is accurate, and the second half sentence may be somewhat misleading.

pseudo-rnd-thoughts commented 1 year ago

From the docstring description and my understanding of the parameter, this is working as expected. Could you link to the stable baselines 3 and cleanrl implementation and provide a script for showing the difference.

sdpkjc commented 1 year ago

stable baselines 3 atari_wrappers cleanrl uses wrappers from sb3.

sdpkjc commented 1 year ago
[{'lives': 4, 'episode_frame_number': 144, 'frame_number': 144}]
[{'lives': 4, 'episode_frame_number': 256, 'frame_number': 400}]
[{'lives': 4, 'episode_frame_number': 180, 'frame_number': 580}]
[{'lives': 4, 'episode_frame_number': 104, 'frame_number': 684}]
[{'lives': 4, 'episode_frame_number': 124, 'frame_number': 808}]
[{'lives': 4, 'episode_frame_number': 200, 'frame_number': 1008}]
[{'lives': 4, 'episode_frame_number': 344, 'frame_number': 1352}]
[{'lives': 4, 'episode_frame_number': 172, 'frame_number': 1524}]
# ...

Do you think this is as expected?

sdpkjc commented 1 year ago

My understanding is that with this option enabled, the game will return the termination signal at the end of each life and continue to the next life. The current code only takes one life, and when that life ends, the game starts over. If what I understand is wrong, we could name the option single_life and remove the self.lives and self.game_over attributions, which might make it easier to understand.

pseudo-rnd-thoughts commented 1 year ago

Looking at the code in stable-baselines 3 this does what I expect https://github.com/DLR-RM/stable-baselines3/blob/e3b24829a57ece38f78436cfea38aaa62d059850/stable_baselines3/common/atari_wrappers.py#L87

The wrapper doesn't reset itself, it just returns terminated=True which the user is expected to reset the environment as result. Otherwise, how do you tell the difference between a life lost and the end of a game.

I haven't looked at the code closely but I agree that the current code is very bad is placed and could be improved. However, the implementation currently is correct

sdpkjc commented 1 year ago

The user looks at self.game_over to tell the difference between the end of life and the end of the game.

I still think there is some problem with the current output. I hope that more users will participate in the discussion to solve this problem in the future. 🫡

pseudo-rnd-thoughts commented 1 year ago

@sdpkjc Looking back at this issue and thinking about the POMDP (partially observable markov decision processes) then the user should just care about when the end of an episode is (end of life vs end of game doesn't matter to the user). Looking at the stable baselines 3 code then I think our code matches in implementation.

Therefore, Im going to close this issue and related PR.

Also the atari preprocessing wrapper is going to be updated to remove the self.game_over attribute as users shouldn't use it

typoverflow commented 1 year ago

Hi @pseudo-rnd-thoughts, I think the real issue @sdpkjc concerned about is the reset(), in sb3 implementation there is a flag self.was_real_done which marks done is due to life loss or gameover, and the game is reset only when all lives are lost. But in the current implementation in AtariPreprocessing, the wrapped env's reset will be called whenever the user calls reset(). Let's suppose, say, we train an DQN agent with term_on_life_loss=True and evaluate it with term_on_life_loss=False, then there will be somewhat mismatch between the training and eval setting -- the agent never meet with the observation coming after the first life loss. I'm still debugging this problem in my Rainbow implementation to see the actual effect.