[Question] PPO rollout with numsteps > episode length

rhelpacc commented 3 years ago

What does it mean when we roll out PPO with numsteps > episode length

I know from the code that it will recycle the environment after you pass the terminal timestep. The question that I have is more fundamental. I don't know how this would impact the trained policy. I could foresee that the upside is that it has more sample to estimates advantage function. But at the same time, I am not sure if it will result in quite a static policy that tries to do well over a long run. If the system dynamic is not stationary, we might end up with a static policy that tried best to balance different regimes of the dynamics.

I'd appreciate if anyone could shed some light. Thank you.

Additional context

N/A

Checklist

[ x ] I have read the documentation (required)
[ x ] I have checked that there is no similar issue in the repo (required)

Miffyli commented 3 years ago

Hmm I am not quite sure if I follow. If a rollout contains samples from multiple episodes, it will not "leak" advantage estimation or anything across episodes. This actually sounds ideal, as you get to average your updates over multiple unique episodes instead of just sub-episode trajectories. I assume this is what you were saying with the first part.

I do not see how this would lead to negative effects vs. not sampling multiple episodes. The problem of static vs. dynamic environments (where transition dynamics change over time) applies to both. Having samples from multiple episodes per update could even capture some of these dynamics to help learning. Please correct me if I misunderstood you.

If you wish to discuss these more fundamental and theoretical things, I recommend joining the RL Discord.

rhelpacc commented 3 years ago

Thank you Miffyli. PPO is new for me. I agree to what you're saying. And thanks for pointing me to RL Discord.

DLR-RM / stable-baselines3