chrisgrimm / soft_actor_critic

2 stars 2 forks source link

code problem #3

Open Joll123 opened 5 years ago

Joll123 commented 5 years ago

First of all, thank you for the code provided. I can not understand two places。 I will grateful if you can help me.

  1. self.pi_loss = pi_loss = tf.reduce_mean(log_pi_sampled * tf.stop_gradient(log_pi_sampled - Q_sampled + V_S1))
  2. tf.reduce_sum(log_prob, axis=1) - tf.reduce_sum(tf.log(1 - tf.square(tf.tanh(u)) + EPS), axis=1) These are out of accord with the paper..
chrisgrimm commented 5 years ago

Hi Joll,

Thanks for making use of my repository! Let me see if I can help...

  1. It does appear that the authors have changed the way they approximate the gradients with respect to the policy parameters. The code snippet that you're looking at mirrors what the authors specified in the first version of their paper https://arxiv.org/pdf/1801.01290v1.pdf in Eq (11).

  2. This line is computing the log probability of the "bounded" action. In the paper the authors found superior performance by taking a ~ N(u(s), v(s)) then "bounding the action" by performing a' = tanh(a). To compute the log probability of a' you need to perform a change of variables https://en.wikipedia.org/wiki/Probability_density_function#Function_of_random_variables_and_change_of_variables_in_the_probability_density_function. This is explained in Eq (21) in the appendix of the paper under the section titled "Enforcing Action Bounds."

Best Regards, Chris Grimm

On Fri, Sep 27, 2019 at 8:28 AM Joll123 notifications@github.com wrote:

First of all, thank you for the code provided. I can not understand two places。 I will grateful if you can help me.

  1. self.pi_loss = pi_loss = tf.reduce_mean(log_pi_sampled * tf.stop_gradient(log_pi_sampled - Q_sampled + V_S1))
  2. tf.reduce_sum(log_prob, axis=1) - tf.reduce_sum(tf.log(1 - tf.square(tf.tanh(u)) + EPS), axis=1) These are out of accord with the paper..

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/chrisgrimm/soft_actor_critic/issues/3?email_source=notifications&email_token=AANMC7K2EWRVNXFPIQYP23TQLWYYNA5CNFSM4I3C2SVKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HOCO2VA, or mute the thread https://github.com/notifications/unsubscribe-auth/AANMC7MXEHUXVQ3IUNVC4L3QLWYYNANCNFSM4I3C2SVA .

Joll123 commented 5 years ago

Thanks for your reply. Can you share your experience in debugging your code? I run my own environment and defeat.

chrisgrimm commented 5 years ago

Sorry for the delay, I've been sick with a cold. What environment are you running on?

In general, I usually assess the correctness of these algorithms by running them on very simple domains (like cartpole) to get a rough sense if they are working at all. Then once I'm sure they work there, I move to more complicated domains (like half-cheetah).

On Sat, Oct 12, 2019 at 12:53 AM Joll123 notifications@github.com wrote:

Thanks for your reply. Can you share your experience in debugging your code? I run my own environment and defeat.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/chrisgrimm/soft_actor_critic/issues/3?email_source=notifications&email_token=AANMC7IFK3663IHWGL3NI4DQOFJ5JA5CNFSM4I3C2SVKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEBBVUUQ#issuecomment-541284946, or unsubscribe https://github.com/notifications/unsubscribe-auth/AANMC7KMSH3UFROBOZFYR4TQOFJ5JANCNFSM4I3C2SVA .

Joll123 commented 5 years ago

oh, I am so sorry to hear that, hope you could be better soon. My environment is to make the robot find the target. I run your code at cartpole environment and success. I changed the neural network to fit my environment but failed to run.Whether the dropout and BN operators I added will have an effect.