Dose A2C support experience replay?

MG2033 / A2C

A Clearer and Simpler Synchronous Advantage Actor Critic (A2C) Implementation in TensorFlow

Apache License 2.0

183 stars 37 forks source link

Dose A2C support experience replay? #7

Open ShaoyuanLi opened 6 years ago

ShaoyuanLi commented 6 years ago

I read your code and implement a version with experience replay. However, I find that the loss explode after a few frames(almost 1000). Value loss would be very large and action loss would be very negatively large.Is it code error or A2C doesn't support experience replay in theory?

csxeba commented 5 years ago

It is an on-policy method. Old data is practically from another policy, so it isn't a very good idea to update the policy network on old samples. I'm not quite sure about the value estimator though. You might get away with using a replay buffer to train the value network only.

YangRui2015 commented 5 years ago

csxeba is right, A2C and A3C are on-policy methods. Old datas are sampled by old policy, they are clearly not from the same distribution. We usually use a replay buffer to save the data sampled from the same policy, and after update we need to clear it.