Closed jbakams closed 3 years ago
Hi, thanks for writing. Ratio b/w old and new prob is 1 only for the first epoch. And Agent is trained for 10 epoch for each set of experience. Agent will show no learning if ratio is always 1. I Hope that helped. Thanks.
@abhisheksuran Yeah, I have seen what I missed!! Thanks.
Hi @abhisheksuran, thank you for the amazing implementation. Seems like the ratio between the old probability and new probability will always be 1 as they are always the same in the PPO code. I used the following approach to be sure they are likely to be different ( see lines commented with many #######). I may be wrong,. If so! Please enlighten me.
And then in the function learn:
Then to be sure to have a set of of old_probability for the first run