Closed lixiyun98 closed 1 year ago
We also met this problem. The loss, entropy loss, value loss, and policy gradient loss increase during the training process, with the reward increasing. We have tried all the layouts, and the loss decreases only when the reward is 0, as we usually know. We tried checking the call to SB3 and the computation of loss but found no obvious errors. And no errors or warnings were encountered during training. Would there be something wrong with the SB3 itself? Or the print of the log has something wrong?
I believe that this behavior is expected and isn't an issue. SB3 (and all other PPO implementations I'm aware of) use a mean-squared error loss between the expected returns from the value function and the "true" value from the rollout. When policies are stochastic and their expected rewards are increasing, this will naturally lead to a higher variance in returns and a higher MSE (even though it is more accurate).
In my experience, the value loss and policy loss reported by PPO (or most other RL algorithms) do not provide a strong signal for "learning." I've noticed this same behavior with single-player environments like Cartpole, and also with different implementations of PPO (like CleanRL or Garage).
Hello, I run this framework with the command for training PPO
python3 trainer.py OvercookedMultiEnv-v0 PPO PPO --env-config '{"layout_name": "simple"}' --seed 10 --preset 1
, However, in the training log, I found the loss of the value function increases as the reward increases, I am confused for this, and the ep_rew_mean can reach 300 when total-timesteps is 500000. And I wonder how to solve this because this looks like a bug.