Open Khev opened 5 years ago
On line 204 we call the function self.transform_reward() which transforms the content of the reward array into the discounted reward, hope that clarifies
Ah ya, that makes sense. Thanks!
Also, I noticed you didn't use target networks for the critic. Did you observe any instability in the learning as a result? Just curious!
Hi there, thanks for sharing your code. I think there's an error on line 280 in main.py
_critic_loss = self.critic.fit([obs], [reward], batch_size=BATCHSIZE, shuffle=True, epochs=EPOCHS, verbose=False)
Shoudn't the critic be fitting to the discounted_returns instead of the rewards? That is the line should read
_critic_loss = self.critic.fit([obs], [discounted_returns], batch_size=BATCHSIZE, shuffle=True, epochs=EPOCHS, verbose=False)