Closed sdpkjc closed 1 year ago
Could you explain the issue with more detail, to me this is the point of the parameter that if an agent loss a life in the environment then the game ends. Therefore, the output that you show is what is expected with this parameter enabled.
It is noted in the AtariPreprocessing
wrapper that it follows Machado et al. (2018), "Revisiting the Arcade Learning Environment: Evaluation Protocols and Open Problems for General Agents" not the original DQN parameters
The expected output:
[{'lives': 4, 'episode_frame_number': 152, 'frame_number': 152}]
dict_keys(['lives', '_lives', 'episode_frame_number', '_episode_frame_number', 'frame_number', '_frame_number', 'final_observation', '_final_observation', 'final_info', '_final_info'])
[{'lives': 3, 'episode_frame_number': 268, 'frame_number': 268}]
dict_keys(['lives', '_lives', 'episode_frame_number', '_episode_frame_number', 'frame_number', '_frame_number', 'final_observation', '_final_observation', 'final_info', '_final_info'])
[{'lives': 2, 'episode_frame_number': 364, 'frame_number': 364}]
dict_keys(['lives', '_lives', 'episode_frame_number', '_episode_frame_number', 'frame_number', '_frame_number', 'final_observation', '_final_observation', 'final_info', '_final_info'])
[{'lives': 1, 'episode_frame_number': 580, 'frame_number': 580}]
dict_keys(['lives', '_lives', 'episode_frame_number', '_episode_frame_number', 'frame_number', '_frame_number', 'final_observation', '_final_observation', 'final_info', '_final_info'])
[{'lives': 0, 'episode_frame_number': 689, 'frame_number': 689}]
dict_keys(['lives', '_lives', 'episode_frame_number', '_episode_frame_number', 'frame_number', '_frame_number', 'final_observation', '_final_observation', 'final_info', '_final_info'])
[{'lives': 4, 'episode_frame_number': 224, 'frame_number': 913}]
dict_keys(['lives', '_lives', 'episode_frame_number', '_episode_frame_number', 'frame_number', '_frame_number', 'final_observation', '_final_observation', 'final_info', '_final_info'])
[{'lives': 3, 'episode_frame_number': 332, 'frame_number': 1021}]
dict_keys(['lives', '_lives', 'episode_frame_number', '_episode_frame_number', 'frame_number', '_frame_number', 'final_observation', '_final_observation', 'final_info', '_final_info'])
[{'lives': 2, 'episode_frame_number': 460, 'frame_number': 1149}]
dict_keys(['lives', '_lives', 'episode_frame_number', '_episode_frame_number', 'frame_number', '_frame_number', 'final_observation', '_final_observation', 'final_info', '_final_info'])
[{'lives': 1, 'episode_frame_number': 956, 'frame_number': 1645}]
dict_keys(['lives', '_lives', 'episode_frame_number', '_episode_frame_number', 'frame_number', '_frame_number', 'final_observation', '_final_observation', 'final_info', '_final_info'])
[{'lives': 0, 'episode_frame_number': 1252, 'frame_number': 1941}]
dict_keys(['lives', '_lives', 'episode_frame_number', '_episode_frame_number', 'frame_number', '_frame_number', 'final_observation', '_final_observation', 'final_info', '_final_info'])
[{'lives': 4, 'episode_frame_number': 184, 'frame_number': 2125}]
dict_keys(['lives', '_lives', 'episode_frame_number', '_episode_frame_number', 'frame_number', '_frame_number', 'final_observation', '_final_observation', 'final_info', '_final_info'])
[{'lives': 3, 'episode_frame_number': 292, 'frame_number': 2233}]
dict_keys(['lives', '_lives', 'episode_frame_number', '_episode_frame_number', 'frame_number', '_frame_number', 'final_observation', '_final_observation', 'final_info', '_final_info'])
[{'lives': 2, 'episode_frame_number': 388, 'frame_number': 2329}]
dict_keys(['lives', '_lives', 'episode_frame_number', '_episode_frame_number', 'frame_number', '_frame_number', 'final_observation', '_final_observation', 'final_info', '_final_info'])
[{'lives': 1, 'episode_frame_number': 484, 'frame_number': 2425}]
dict_keys(['lives', '_lives', 'episode_frame_number', '_episode_frame_number', 'frame_number', '_frame_number', 'final_observation', '_final_observation', 'final_info', '_final_info'])
[{'lives': 0, 'episode_frame_number': 589, 'frame_number': 2530}]
dict_keys(['lives', '_lives', 'episode_frame_number', '_episode_frame_number', 'frame_number', '_frame_number', 'final_observation', '_final_observation', 'final_info', '_final_info'])
[{'lives': 4, 'episode_frame_number': 156, 'frame_number': 2686}]
# ...
This parameter is not recommended in Machado et al. (2018). It is a legacy parameter. This method was first used in Nature DQN. I looked at the literature and understood that it meant returning a termination signal at the end of life, not really terminating the environment completely at the end of life. Both Cleanrl and sb3 are implemented in this way as well.
You can also see that the AtariPreprocessing wrapper contains the self.lives
and self.game_over
attributions, but neither is used. This means that it was intended to implement this feature, but somehow it wasn't fully implemented, which could be a bug.
atari_preprocessing.py L24-25:
- Termination signal when a life is lost: When the agent losses a life during the environment, then the environment is terminated.
Turned off by default. Not recommended by Machado et al. (2018).
Here, I think the first half sentence(Termination signal when a life is lost
) is accurate, and the second half sentence may be somewhat misleading.
From the docstring description and my understanding of the parameter, this is working as expected. Could you link to the stable baselines 3 and cleanrl implementation and provide a script for showing the difference.
stable baselines 3 atari_wrappers cleanrl uses wrappers from sb3.
[{'lives': 4, 'episode_frame_number': 144, 'frame_number': 144}]
[{'lives': 4, 'episode_frame_number': 256, 'frame_number': 400}]
[{'lives': 4, 'episode_frame_number': 180, 'frame_number': 580}]
[{'lives': 4, 'episode_frame_number': 104, 'frame_number': 684}]
[{'lives': 4, 'episode_frame_number': 124, 'frame_number': 808}]
[{'lives': 4, 'episode_frame_number': 200, 'frame_number': 1008}]
[{'lives': 4, 'episode_frame_number': 344, 'frame_number': 1352}]
[{'lives': 4, 'episode_frame_number': 172, 'frame_number': 1524}]
# ...
Do you think this is as expected?
My understanding is that with this option enabled, the game will return the termination signal at the end of each life and continue to the next life. The current code only takes one life, and when that life ends, the game starts over.
If what I understand is wrong, we could name the option single_life
and remove the self.lives
and self.game_over
attributions, which might make it easier to understand.
Looking at the code in stable-baselines 3 this does what I expect https://github.com/DLR-RM/stable-baselines3/blob/e3b24829a57ece38f78436cfea38aaa62d059850/stable_baselines3/common/atari_wrappers.py#L87
The wrapper doesn't reset itself, it just returns terminated=True
which the user is expected to reset the environment as result. Otherwise, how do you tell the difference between a life lost and the end of a game.
I haven't looked at the code closely but I agree that the current code is very bad is placed and could be improved. However, the implementation currently is correct
The user looks at self.game_over
to tell the difference between the end of life and the end of the game.
I still think there is some problem with the current output. I hope that more users will participate in the discussion to solve this problem in the future. 🫡
@sdpkjc Looking back at this issue and thinking about the POMDP (partially observable markov decision processes) then the user should just care about when the end of an episode is (end of life vs end of game doesn't matter to the user). Looking at the stable baselines 3 code then I think our code matches in implementation.
Therefore, Im going to close this issue and related PR.
Also the atari preprocessing wrapper is going to be updated to remove the self.game_over
attribute as users shouldn't use it
Hi @pseudo-rnd-thoughts,
I think the real issue @sdpkjc concerned about is the reset()
, in sb3 implementation there is a flag self.was_real_done
which marks done
is due to life loss or gameover, and the game is reset only when all lives are lost. But in the current implementation in AtariPreprocessing
, the wrapped env's reset will be called whenever the user calls reset()
. Let's suppose, say, we train an DQN agent with term_on_life_loss=True
and evaluate it with term_on_life_loss=False
, then there will be somewhat mismatch between the training and eval setting -- the agent never meet with the observation coming after the first life loss. I'm still debugging this problem in my Rainbow implementation to see the actual effect.
Describe the bug
When using AtariPreprocessing Wrapper,
terminal_on_life_loss
is set to true, there is a problem that the entire Atari game cannot be run. Instead of moving on to the next life, the game will reset completely when it runs out of life.I observe that AtariPreprocessing Wrapper stores information about the number of lives in the environment, but does not use it. We may need to refer to EpisodicLifeEnv Wrapper in stable-baselines3 to modify
reset
func of AtariPreprocessing Wrapper.Code example
Output:
System info
Additional context
No response
Checklist