chingyaoc / pytorch-REINFORCE

PyTorch Implementation of REINFORCE for both discrete & continuous control
262 stars 51 forks source link

Please add some explanation #2

Open parajain opened 6 years ago

parajain commented 6 years ago

Hi,

Thank you for the sample code. I could not understand what exactly is happening here: https://github.com/JamesChuanggg/pytorch-REINFORCE/blob/master/reinforce_discrete.py#L52

If possible can you please give a little explanation.

Thanks

hortune commented 6 years ago

it's just maximizing the function.

zafarali commented 6 years ago

This is where the loss is being calculated. If you look at the algorithm presented in Suttons book (page 289) it is slightly different from what is given here which is closer to Deep RL - Policy Gradients (page 34).

Basically what is happening is that instead of applying an update step after calculating each advantage * grad log pi, we calculate all the terms and them sum them into the loss so that we can call the backward() on that. I am not sure what the theoretical differences are between applying t updates per episode vs 1 update per episode but I am currently looking into it.

gmftbyGMFTBY commented 5 years ago

Also confused about the entropies in the loss function, can anyone make a little explanation ?