Open ShaoyuanLi opened 6 years ago
It is an on-policy method. Old data is practically from another policy, so it isn't a very good idea to update the policy network on old samples. I'm not quite sure about the value estimator though. You might get away with using a replay buffer to train the value network only.
csxeba is right, A2C and A3C are on-policy methods. Old datas are sampled by old policy, they are clearly not from the same distribution. We usually use a replay buffer to save the data sampled from the same policy, and after update we need to clear it.
I read your code and implement a version with experience replay. However, I find that the loss explode after a few frames(almost 1000). Value loss would be very large and action loss would be very negatively large.Is it code error or A2C doesn't support experience replay in theory?