PacktPublishing / Deep-Reinforcement-Learning-Hands-On

Hands-on Deep Reinforcement Learning, published by Packt
MIT License
2.83k stars 1.28k forks source link

Chapter14 DDPG code about policy gradient #59

Open GuilinF opened 4 years ago

GuilinF commented 4 years ago

Sir, thanks for your book and code. it is very nice. In chapter14 DDPG code, i can't understand the code about the actor policy gradient update. This is part of your relevant code act_opt.zero_grad() cur_actions_v = act_net(states_v) actor_loss_v = -crt_net(states_v, cur_actions_v) actor_loss_v = actor_loss_v.mean() actor_loss_v.backward() act_opt.step() tb_tracker.track("loss_actor", actor_loss_v, frame_idx)

but I can't correspond to the DDPG paper formula. The paper writes update method for multiply of two gradients.

Thank you for your patience

josiahls commented 1 year ago

~I honestly feel like the code from the book is wrong. Like this does train but I think it is incorrect, and slower than it should.~

~I say this because the above loss will go up to 10,20,30+, which isnt great. The main reason why this is the case is because the loss from the book is the literal q value, not the gradient / derivative. I dont think you want to be working with literal q values as losses since they can be much larger and smaller than 1.~

~If you do a gradient, it will likely stay a reasonable number regardless of whether q is 1 or 1000.~

~Once again, this code does work, buuuut I think it would train faster and more stably usign the gradients instead.~

Edit: Thinking about this more, the reason this doesnt match the equation is because you're using pytorch's autograd to do the gradients for you. Its not a particularly satisfying answer, and I find the idea of the actor_loss_v steadily growing larger concerning, and that it might be slowing training.