Actor Critic for Atari games

dennybritz / reinforcement-learning

Implementation of Reinforcement Learning Algorithms. Python, OpenAI Gym, Tensorflow. Exercises and Solutions to accompany Sutton's Book and David Silver's course.

http://www.wildml.com/2016/10/learning-reinforcement-learning/

MIT License

20.47k stars 6.03k forks source link

Actor Critic for Atari games #16

Closed IbrahimSobh closed 7 years ago

IbrahimSobh commented 7 years ago

Dear Danny

Thank you for the great work! I have two questions:

1- Is it possible to change the “CliffWalk Actor Critic Solution.ipynb” code to implement Actor-Critic for Gym Arari games?

I believe Actor Critic is on-policy algorithm where value based and policy based methods are used together.

Using experience reply is important to de-correlate samples for non-linear approximators. On the other hand, experience reply requires off-policy learning algorithms that can update from data generated by an older policy. ("Minh 2016")

I was thinking think that, it is possible to do the following: • Change the “Policy estimator” state to be 4 stacked observations (similar to DQN code) • Change the “Value Estimator” state to be 4 stacked observations (similar to DQN code) • For both, “Policy estimator” and “Value Estimator”, we use non linear Function approximator, Convolution Network similar to DQN code • Then, use Experience Replay and use batches.

However, I am not sure about the results

2- How does A3C work? I think: • Using multiple agents, each agent has his own environment. • No using of experience reply. • Then what? • How do samples become de correlated? (Because they come from different agents?) • But each agent has its own model and has to “update” its own model params before sending to the higher level. But the agent produces correlated transitions on its environment. (I am confused)

Thank you

dennybritz commented 7 years ago

Your understanding seems right. But standard actor-critic methods are always on-policy, so you can't just use experience replay with them (that's one of the motivations for A3C). I think there is research on off-policy actor critic methods, but I'm not too familiar with that.
Yes, samples are decorrelated because each agent generates them independently in its own environment. Each agent doesn't update its own model - it only updates the "shared" model that all agents use, asynchronously.

IbrahimSobh commented 7 years ago

thank you

IbrahimSobh commented 7 years ago

Thank you

"Each agent doesn't update its own model - it only updates the "shared" model that all agents use, asynchronously"

I just want to make sure that:

Suppose that, Agent agent_i, will take action a_i, in environment e_i, make a transition t_i. Then the shared model will be updated according to this transition. And so on, all other agents make the same asynchronously.

There is nothing special for each agent except it has its own environment. We have only one model, which is the shared model.

There is no local parameters per agent and there is no model per agent. OR each agent has a copy of the shared model that is updated periodically?

The shared A3C model is actually a normal Actor Critic Model, except that it takes transitions from different agents in different environments.

In other words, If I have 5 agents in 5 environments, and then I used your code "CliffWalk Actor Critic Solution.ipynb" where in each round I access one of the agents (round Robbin), it will be almost A3C?

Finally, could you please just give a hint how and if it is possible to change the “CliffWalk Actor Critic Solution.ipynb” code to implement Actor-Critic for Gym Arari games?

fferreres commented 7 years ago

You may want to check this article: https://blog.acolyer.org/2016/10/10/asynchronous-methods-for-deep-reinforcement-learning/

"Because the parallel approach no longer relies on experience replay, it becomes possible to use ‘on-policy’ reinforcement learning methods such as Sarsa and actor-critic. The authors create asynchronous variants of one-step Q-learning, one-step Sarsa, n-step Q-learning, and advantage actor-critic. Since the asynchronous advantage actor-critic (A3C) algorithm appears to dominate all the others, I’ll just concentrate on that one."

Each actor follows an on policy approach, the key is how to make the updates to the global policy being learned from the actors and back.

IbrahimSobh commented 7 years ago

Thank you very much fferreres

I just want to make sure that my understanding is correct. According A3C to pseudo-code:

There are two sets of parameters for policy and value: 1) Global shared (For the global parameters) 2) Thread specific parameters

Loop Agent will sync its parameters from the Global shared parameters

For t_max steps:

actions are selected according to policy based on thread specific parameters

value is updated based on thread specific parameters

And then .... Accumulate gradients, along the t_max steps, w.r.t. agent Local parameters

Then global parameters are updated using the Accumulate gradients asynchronously.

Now my questions/comments are:

t_max steps that agent performs are actually correlated, is this harmful? why?
How exactly global parameters are updated using the Accumulate gradients? for example: theta_global = theta_global - alpha * Accumulate gradients?
Do you think A3C is faster than others because the threading implementation or because the Actor Critic itself?

Just for better understanding and simpler implementation, I was thinking of implementing the A3C algorithm using the "CliffWalk Actor Critic Solution.ipynb" code, can we do this in round robin fashion (no threading)? I understand it will be slower, but is it correct implementation?

I understand that "CliffWalk Actor Critic Solution.ipynb" implements the backward view (not the forward view like the A3C paper https://arxiv.org/pdf/1602.01783v2.pdf)

Finally, How to change the “CliffWalk Actor Critic Solution.ipynb” code to implement Actor-Critic for Gym Arari games?

Regards

fferreres commented 7 years ago

@IbrahimSobh I am really a novice. I think you have it right now, but that for details on the async updates you need to really go into the actual paper (linked by Denny in the root Readme.md, from the bottom, third paper, Feb 2016 - has been updated to V2 now).

t_max steps that agent performs are actually correlated, is this harmful? why?

While it has correlation, it doesn't get trapped into a "bad ascent" because of the influence of the other learners. Also, AC methods separate policy and value function and (ignorantly) I think that it matters more in Q-Learning and REINFORCE.

How exactly global parameters are updated using the Accumulate gradients? for example: theta_global = theta_global - alpha * Accumulate gradients?

Read the third paper from the bottom linked (Feb 2016) by Denny in this repo root Readme.md. It answers this.

Do you think A3C is faster than others because the threading implementation or because the Actor Critic itself?

It's not faster because of the async updates - the paper (which you will really like) compares 4 algorithms one of which is Actor-Critic (w/baseline I guess since they mention Advantage), but overall more data efficient and learns better functions for many games (see the table in the paper).

IbrahimSobh commented 7 years ago

Thank you fferreres

Actually, I read the paper, but some points are not clear. Anyway, I will wait for comments or more clarifications from Denny.

Regards

dennybritz commented 7 years ago

I added an implementation of A3C: https://github.com/dennybritz/reinforcement-learning/tree/master/PolicyGradient/a3c

t_max steps that agent performs are actually correlated, is this harmful? why?

Probably, but there's a tradeoff between computational efficiency and correlation. With t_max=1 you would be performing an update at each step, i.e. not using minibatches. Because large matrix multiplications are cheap you're probably better off performing batch updates, even if they're correlated.

How exactly global parameters are updated using the Accumulate gradients? for example: theta_global = theta_global - alpha * Accumulate gradients?

Yes, that's how they're updated.

Do you think A3C is faster than others because the threading implementation or because the Actor Critic itself?

Not sure, but the other methods compared in the paper are not policy gradient methods, they are value-based methods. So that's probably the main reason.

I was thinking of implementing the A3C algorithm using the "CliffWalk Actor Critic Solution.ipynb" code, can we do this in round robin fashion (no threading)? I understand it will be slower, but is it correct implementation?

I think that should be correct.

Since the implementation is here I'm closing this for now. Feel free to re-open,