germain-hug / Deep-RL-Keras

Keras Implementation of popular Deep RL Algorithms (A3C, DDQN, DDPG, Dueling DDQN)
528 stars 149 forks source link

Question about A2C #1

Closed Khev closed 5 years ago

Khev commented 6 years ago

Hi there, thanks for sharing your code -- its been very helpful!

One question: is your implementation of the A2C a 'genuine' actor-critic method? My (limited) understanding was that to qualify as an actor-critic method, there needed to be temporal difference learning; you learn from each (S,A,R,S') transition, as opposed to executing a full episode, and then learning. I'm following the logic in Sutton's book, the relevant part of which I'm quoting below.

Anyway -- I'm just curious, and thanks again!


Can find the textbook at http://incompleteideas.net/book/the-book-2nd.html Quote is from page 331

"Although the REINFORCE-with-baseline method learns both a policy and a state-value function, we do not consider it to be an actor–critic method because its state-value function is used only as a baseline, not as a critic. That is, it is not used for bootstrapping (updating the value estimate for a state from the estimated values of subsequent states), but only as a baseline for the state whose estimate is being updated. This is a useful distinction, for only through bootstrapping do we introduce bias and an asymptotic dependence on the quality of the function approximation. As we have seen, the bias introduced through bootstrapping and reliance on the state representation is often beneficial because it reduces variance and accelerates learning. REINFORCE with baseline is unbiased and will converge asymptotically to a local minimum, but like all Monte Carlo methods it tends to learn slowly (produce estimates of high variance) and to be inconvenient to implement online or for continuing problems. As we have seen earlier in this book, with temporal-di↵erence methods we can eliminate these inconveniences, and through multi-step methods we can flexibly choose the degree of bootstrapping. In order to gain these advantages in the case of policy gradient methods we use actor–critic methods with a bootstrapping critic."

germain-hug commented 6 years ago

Hi and thank you for your interest,

First regarding the passage you sent me, A2C is different from Reinforce, in that it learns an action-value function which helps reduce the variance and guide the actor.

Then I believe you can indeed use TD Learning but without gathering enough experience Learning is very hard and will not converge. The variant I implemented takes in n-steps, which makes the algorithm “halfway” between Reinforce and A2C. It preserves a low variance (we use a proper critic as opposed to a direct state value function), and gather information over sufficient steps to get better estimates of the gradients.

Basically, you don’t have to wait for the episode to be over to learn, but it will be harder to converge.

Cheers!

Le 7 sept. 2018 à 00:21, Khev notifications@github.com a écrit :

Hi there, thanks for sharing your code -- its been very helpful!

One question: is your implementation of the A2C a 'genuine' actor-critic method? My (limited) understanding was that to qualify as an actor-critic method, there needed to be temporal difference learning; you learn from each (S,A,R,S') transition, as opposed to executing a full episode, and then learning. I'm following the logic in Sutton's book, the relevant part of which I'm quoting below.

Anyway -- I'm just curious, and thanks again!

Can find the textbook at http://incompleteideas.net/book/the-book-2nd.html Quote is from page 331

"Although the REINFORCE-with-baseline method learns both a policy and a state-value function, we do not consider it to be an actor–critic method because its state-value function is used only as a baseline, not as a critic. That is, it is not used for bootstrapping (updating the value estimate for a state from the estimated values of subsequent states), but only as a baseline for the state whose estimate is being updated. This is a useful distinction, for only through bootstrapping do we introduce bias and an asymptotic dependence on the quality of the function approximation. As we have seen, the bias introduced through bootstrapping and reliance on the state representation is often beneficial because it reduces variance and accelerates learning. REINFORCE with baseline is unbiased and will converge asymptotically to a local minimum, but like all Monte Carlo methods it tends to learn slowly (produce estimates of high variance) and to be inconvenient to implement online or for continuing problems. As we have seen earlier in this book, with temporal-di↵erence methods we can eliminate these inconveniences, and through multi-step methods we can flexibly choose the degree of bootstrapping. In order to gain these advantages in the case of policy gradient methods we use actor–critic methods with a bootstrapping critic."

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

Khev commented 6 years ago

Ah, I see. That makes sense. I just tested the A2C with learning at every step. As you expected, the learning is slow / unstable.

Re A2C learning an action-value function (as opposed to a value function): in your implementation, isn't the critic learning a value-function? I'm looking at line 23 in "critic.py", where the output dimension is 1. If it were an action-value function wouldn't the output be of size action_dim? Further, in line 62 of A2C.py, you call

state_values = self.critic.predict(np.array(states))

Which has no reference to actions. In other words, I thought an action-value function would have form Q(s,a). Sorry if I'm missing something -- I'm new to the field and haven't got all the jargon right yet.

Thanks again!

germain-hug commented 5 years ago

Sorry about the late response !

isn't the critic learning a value-function? I'm looking at line 23 in "critic.py", where the output dimension is 1. If it were an action-value function wouldn't the output be of size action_dim?

Not exactly, the critic is learning a state-action pair (meaning taking both the state and action as input), but still produces a single value.

Further, in line 62 of A2C.py, you call state_values = self.critic.predict(np.array(states)). In other words, I thought an action-value function would have form Q(s,a)

Thanks for pointing it out, it looks like it's a mistake on my side. It should indeed take into account the action as well otherwise it is not a proper advantage actor-critic algorithm. I will try to look into correcting that as soon as I can.

Khev commented 5 years ago

No worries about late response. You are right about the Q(s,a) returning a value -- I don't know what I was thinking :-)

Thanks again for sharing your code. I was a complete novice to RL a couple of weeks ago, and in trying to recreate your codebase I've become reasonably fluent. Trying to implement MADDPG right now, I'm psyched to see what it can do...