astooke / rlpyt

Reinforcement Learning in PyTorch
MIT License
2.22k stars 323 forks source link

AtariEnv, should repeat_action_probability be greater than 0 for sticky actions? #105

Closed DanielTakeshi closed 4 years ago

DanielTakeshi commented 4 years ago

Hi @astooke

I am working on debugging some issues related to random-ness and determinism in the Atari environments. Here is where I think the random-ness comes from for a vanilla DQN agent that runs an epsilon-greedy policy:

Do these four exhaustively cover all sources of random-ness for an epsilon-greedy DQN based agent?

I quickly checked some of the values:

https://github.com/astooke/rlpyt/blob/d797dd8835a91a7b2902563a097c83ed7c8d3e92/rlpyt/envs/atari/atari_env.py#L67-L76

and it looks like the repeat action probability is 0, so we are not using sticky actions. I am wondering if there is a reason for not enabling this by default. I searched the repository but could not find any code that explicitly changes the repeat action probability. [I am also wondering if you set repeat action probability to a higher value for the benchmarks in the white paper.]

astooke commented 4 years ago

Hi, good questions! Yes despite that paper urging everyone to start using sticky actions, basically none of the algorithm benchmarks we tried to reproduce use it (going all the way up to R2D2). But you're right, the repeat_action_probability kwarg is exactly sticky actions, and that paper suggested using 0.25.

As for the random seeds. I think the way it works is that each worker gets its own seed, where each worker might have several environment instances. The environment instances do not get their own random seeds. I think you're right the places for randomness in the environment are the sticky actions (which happens inside the ALE code) and the random noops (which happens in the rlpyt code).
And the randomness for the agent is all for sampling towards epsilon-greedy. If you use GPU sampler, this happens inside the master, according to its seed, if you use CPU sampler, this happens inside each worker. Also the agent's initial parameters are based on the master process's random seed.

Off the top of my head, I can't think of other randomness, since ALE is otherwise deterministic. Except, I think some convolution procedures might not be deterministic, while others are...I remember playing around with that in Theano (settings which are passed to cudnn), but I haven't done it with PyTorch.

DanielTakeshi commented 4 years ago

Thanks! OK, I guess we can close this since my questions are resolved, as long as we make it clear if we're using sticky or not, people should know what we mean.