[bug] DQN/dqn.py: Incorrect loss function. [question] Question about RMSProp paramethers

Kropekk commented 6 years ago

First of all - thank you very much for this repository! You have made diving into Reinforcement Learning easier!

About the issue: I think you should use huber_loss instead of square_difference. Look for "clipping the squared error" in the paper. (Also this: https://github.com/devsisters/DQN-tensorflow/issues/16 comment from Andrej Karpathy may be useful).

Also, I have a question about RMSProp parameters. In the code it is said that parameters passed to optimizer are "from original paper". However, I cannot find these values in "Human-Level Control through Deep Reinforcement Learning" paper. For example, you use decay=0.99, which seems to be not mentioned at all in the paper (I believe that discount factor refers to γ in target = r + γQ(s',a) equation, not the decay in RMSProp. Or maybe it refers to both?). Also, shouldn't momentum be equal to 0.95? Where did your epsilon=1e-6 come from? Could you, please, point out where these values come from? I would really appreciate that!

QiXuanWang commented 6 years ago

square_difference is recommened in Silver's slides I remember.

My puzzle here is actually about whether we could use the td_error as the target although td_target looks more reasonable, it differs from Sutton's pseudo code.
Any comment on this?

Kropekk commented 6 years ago

Huber_loss is a squared_difference with gradient clipping applied. I think there is no contradiction here.

I'm not sure what do you mean in td_error vs td_target part. We try to minimize td_error while estimating td_target. I've looked Suttong's book up and I don't see any mistakes there.

If you were more specific and provided quotes/screenshots it would be easier to solve your puzzle :)

QiXuanWang commented 6 years ago

OK. I'll read that Karpathy paper to understand. For td_error, actually I was talking about "CliffWalk Actor Critic Solution":

            # Calculate TD Target
            value_next = estimator_value.predict(next_state)
            td_target = reward + discount_factor * value_next
            td_error = td_target - estimator_value.predict(state)

            # Update the value estimator
            estimator_value.update(state, td_target)

            # Update the policy estimator
            # using the td error as our advantage estimate
            estimator_policy.update(state, td_error, action)

In policy update it's using td_error and in value update, it's using td_target. I understand it's correct and natural to use td_target for value update. From Sutton book "Actor-Critic Methods" -> "One step Actor-Critic (Episodic)", he used the td_error for both update. Shouldn't this translate to td_error usage in value update too? I tried and it won't work but it just hard to understand why Sutton wrote that.

δ =  R + γv^(S0;w) − v^(S;w) (if S0 is terminal, then ^ v(S0;w) = 0) :
w = w + αw I δrwv^(S;w)
θ = θ + αθ I δrθ ln π(AjS; θ)

Thanks.

Kropekk commented 6 years ago

It's DeepMind's paper - "Human-Level Control through Deep Reinforcement Learning" - co-authored with David Silver. PDF can be found here: https://github.com/dennybritz/reinforcement-learning/tree/master/DQN

This Issue is about DQN implementation and unfortunately I cannot help you with Actor Critic Solution right now.

QiXuanWang commented 6 years ago

I mean the Karpathy comment . No problem. Be kind to comment once you go through Actor Critic though :)

dennybritz / reinforcement-learning

[bug] DQN/dqn.py: Incorrect loss function. [question] Question about RMSProp paramethers #174