[Question] [Multiprocessing] RolloutBuffer groups environment transitions on a per-environment basis.

❓ Question

I would like to clarify something about SB3's source code that differs from my own understanding. (Hence, despite the formatting, this is not a bug report.)

Observation

In the source code, RolloutBuffer (and ReplayBuffer for that matter) appear to store transitions in n_envs-sized chunks.

The samples are then retrieved with these chunks untouched.

This results in an effective minibatch size of n_envs * batch_size transitions.

Expectation

Unlike the documentation for n_steps argument, batch_size did not state this behavior.

Therefore, minibatch size was expected to remain as batch_size.

Question

What are the developers' considerations behind grouping the transitions together like this?
How would this differ from, say, shuffling the transitions individually?

Checklist

[X] I have checked that there is no similar issue in the repo
[X] I have read the documentation
[X] If code there is, it is minimal and working
[X] If code there is, it is formatted using the markdown code blocks for both code and stack traces.

DLR-RM / stable-baselines3