Closed random-user-x closed 6 years ago
This is the correct implementation according to me. The variance has reduced as discussed in the paper. I haven't played with hyperparameters yet. Just a random run gives this result(variance is evaluated every 5000 steps.)
I haven't used the --lr-decay. I think using that will make learning smoother.
Great, thank you very much.
Paper implementation of trust region updates.