DLR-RM / stable-baselines3

PyTorch version of Stable Baselines, reliable implementations of reinforcement learning algorithms.
https://stable-baselines3.readthedocs.io
MIT License
8.84k stars 1.68k forks source link

[Question] Relationship between n_step, episode, and advantage in episodic tasks #1938

Closed d505 closed 4 months ago

d505 commented 4 months ago

❓ Question

Hello!

I have a question about n_step and the relationship between episodes and advantage in episodic tasks. I have an episodic task that ends with the same step every time. And I use PPO. If n_step is greater than episode length, I believe that the advantage function will take into account and compute for the next episode as well. What do you think is actually the case? Then I would prefer n_step equal to episode length without including other episodes.

The following piece of code. https://github.com/DLR-RM/stable-baselines3/blob/master/stable_baselines3/common/buffers.py#L402

On the other hand, other questions seemed to suggest that including other episodes would be a good idea. https://github.com/DLR-RM/stable-baselines3/issues/560

Checklist

qgallouedec commented 4 months ago

If the episode length is constant, and n_step is greater than the episode length, then yes, the rollout will contain data from several episodes. In fact, as explained in the issue you linked, this allows for more stable updates. Why do you want to reduce n_step so that this isn't the case?

d505 commented 4 months ago

Thanks for the reply. I thought it was a problem in calculating the advantage. I understand that Advantage is similar to cumulative rewards. It seemed to me that the reward for an action in another episode is not related to the action in the current episode. I thought the same was true for the Advantage calculation. https://github.com/DLR-RM/stable-baselines3/blob/master/stable_baselines3/common/buffers.py#L425-L434

qgallouedec commented 4 months ago

I'm not 100% sure I understand what you mean, but if the question is whether the advantage calculation takes into account the end of an episode, i.e. whether the summation stops with the end of the episode, then yes, it does. Additionally for the TD(lambda) estimator, you can check #375 or "Telescoping in TD(lambda)" in David Silver Lecture 4: https://www.youtube.com/watch?v=PnHCvfgC_ZA

d505 commented 4 months ago

Thank you. Maybe I didn't understand code well enough. So every time the episode changes, 1 is set in self.episode_starts[] and the advantage is reset.