danijar / dreamerv2

Mastering Atari with Discrete World Models
https://danijar.com/dreamerv2
MIT License
886 stars 195 forks source link

Have you considered using a PPO actor instead of a normal Actor-Critic? #2

Closed outdoteth closed 3 years ago

outdoteth commented 3 years ago

I think a lot of improvement could be made by using a PPO actor.

danijar commented 3 years ago

PPO clips the advantage values so that it can safely train on on-policy data for multiple gradient steps. DreamerV2 uses a world model and thus can generate an unlimited amount of on-policy data without having to interact with the environment, so there is not much of a point in training on the same imagined trajectories multiple times.