DLR-RM / stable-baselines3

PyTorch version of Stable Baselines, reliable implementations of reinforcement learning algorithms.
https://stable-baselines3.readthedocs.io
MIT License
8.85k stars 1.68k forks source link

[Question] [Multiprocessing] RolloutBuffer groups environment transitions on a per-environment basis. #1880

Closed N00bcak closed 6 months ago

N00bcak commented 6 months ago

❓ Question

I would like to clarify something about SB3's source code that differs from my own understanding. (Hence, despite the formatting, this is not a bug report.)

Observation

In the source code, RolloutBuffer (and ReplayBuffer for that matter) appear to store transitions in n_envs-sized chunks.

The samples are then retrieved with these chunks untouched.

This results in an effective minibatch size of n_envs * batch_size transitions.

Expectation

Unlike the documentation for n_steps argument, batch_size did not state this behavior.

Therefore, minibatch size was expected to remain as batch_size.

Question

  1. What are the developers' considerations behind grouping the transitions together like this?
  2. How would this differ from, say, shuffling the transitions individually?

Checklist

N00bcak commented 6 months ago

Apologies, I have made a mistake in interpreting the RolloutBuffer code. The samples are flattened in the get() function, which allows the transitions to be sampled individually.