Closed DanielTakeshi closed 4 years ago
Hi, good questions! Yes despite that paper urging everyone to start using sticky actions, basically none of the algorithm benchmarks we tried to reproduce use it (going all the way up to R2D2). But you're right, the repeat_action_probability
kwarg is exactly sticky actions, and that paper suggested using 0.25.
As for the random seeds. I think the way it works is that each worker gets its own seed, where each worker might have several environment instances. The environment instances do not get their own random seeds. I think you're right the places for randomness in the environment are the sticky actions (which happens inside the ALE code) and the random noops (which happens in the rlpyt code).
And the randomness for the agent is all for sampling towards epsilon-greedy. If you use GPU sampler, this happens inside the master, according to its seed, if you use CPU sampler, this happens inside each worker. Also the agent's initial parameters are based on the master process's random seed.
Off the top of my head, I can't think of other randomness, since ALE is otherwise deterministic. Except, I think some convolution procedures might not be deterministic, while others are...I remember playing around with that in Theano (settings which are passed to cudnn), but I haven't done it with PyTorch.
Thanks! OK, I guess we can close this since my questions are resolved, as long as we make it clear if we're using sticky or not, people should know what we mean.
Hi @astooke
I am working on debugging some issues related to random-ness and determinism in the Atari environments. Here is where I think the random-ness comes from for a vanilla DQN agent that runs an epsilon-greedy policy:
First there is a random seed for a particular environment. This is created upon initialization. If we use 10 parallel environments, each get a different random seed. [I think the only effect of this seed will be to impact the random-ness in the
max_start_noops
andrepeat_action_probability
, described below, is that right? Or is the random-ness inrepeat_action_probability
set separate, somehow?]Second, there is the standard
max_start_noops
of N where the agent takes no-op actions for between 0 and N time steps, where usually N=30.Third, there is a
repeat_action_probability
which leads to sticky actions. This is 0 by default but the paper "Revisiting the Arcade Learning Environment: Evaluation Protocols and Open Problems for General Agents" recommends using sticky actions, and I think their choice of probability is 0.25.Fourth, when the agent takes steps in the environment, it has an epsilon parameter for its epsilon-greedy policy, which starts at 1.0 and decays to a value such as 0.1, 0.01, or 0.001 at the end depending on what we choose.
Do these four exhaustively cover all sources of random-ness for an epsilon-greedy DQN based agent?
I quickly checked some of the values:
https://github.com/astooke/rlpyt/blob/d797dd8835a91a7b2902563a097c83ed7c8d3e92/rlpyt/envs/atari/atari_env.py#L67-L76
and it looks like the repeat action probability is 0, so we are not using sticky actions. I am wondering if there is a reason for not enabling this by default. I searched the repository but could not find any code that explicitly changes the repeat action probability. [I am also wondering if you set repeat action probability to a higher value for the benchmarks in the white paper.]