Open Kropekk opened 6 years ago
square_difference is recommened in Silver's slides I remember.
My puzzle here is actually about whether we could use the td_error as the target although td_target looks more reasonable, it differs from Sutton's pseudo code.
Any comment on this?
Huber_loss is a squared_difference with gradient clipping applied. I think there is no contradiction here.
I'm not sure what do you mean in td_error vs td_target part. We try to minimize td_error while estimating td_target. I've looked Suttong's book up and I don't see any mistakes there.
If you were more specific and provided quotes/screenshots it would be easier to solve your puzzle :)
OK. I'll read that Karpathy paper to understand. For td_error, actually I was talking about "CliffWalk Actor Critic Solution":
# Calculate TD Target
value_next = estimator_value.predict(next_state)
td_target = reward + discount_factor * value_next
td_error = td_target - estimator_value.predict(state)
# Update the value estimator
estimator_value.update(state, td_target)
# Update the policy estimator
# using the td error as our advantage estimate
estimator_policy.update(state, td_error, action)
In policy update it's using td_error and in value update, it's using td_target. I understand it's correct and natural to use td_target for value update. From Sutton book "Actor-Critic Methods" -> "One step Actor-Critic (Episodic)", he used the td_error for both update. Shouldn't this translate to td_error usage in value update too? I tried and it won't work but it just hard to understand why Sutton wrote that.
δ = R + γv^(S0;w) − v^(S;w) (if S0 is terminal, then ^ v(S0;w) = 0) :
w = w + αw I δrwv^(S;w)
θ = θ + αθ I δrθ ln π(AjS; θ)
Thanks.
It's DeepMind's paper - "Human-Level Control through Deep Reinforcement Learning" - co-authored with David Silver. PDF can be found here: https://github.com/dennybritz/reinforcement-learning/tree/master/DQN
This Issue is about DQN implementation and unfortunately I cannot help you with Actor Critic Solution right now.
I mean the Karpathy comment . No problem. Be kind to comment once you go through Actor Critic though :)
First of all - thank you very much for this repository! You have made diving into Reinforcement Learning easier!
About the issue: I think you should use huber_loss instead of square_difference. Look for "clipping the squared error" in the paper. (Also this: https://github.com/devsisters/DQN-tensorflow/issues/16 comment from Andrej Karpathy may be useful).
Also, I have a question about RMSProp parameters. In the code it is said that parameters passed to optimizer are "from original paper". However, I cannot find these values in "Human-Level Control through Deep Reinforcement Learning" paper. For example, you use decay=0.99, which seems to be not mentioned at all in the paper (I believe that discount factor refers to γ in target = r + γQ(s',a) equation, not the decay in RMSProp. Or maybe it refers to both?). Also, shouldn't momentum be equal to 0.95? Where did your epsilon=1e-6 come from? Could you, please, point out where these values come from? I would really appreciate that!