[Question] How does episode sampling handle environment resets?

Hi, I'm confused about how episodes are sampled from the replay buffer, since episodes may have different lengths, and different episodes might be played in different environments due to resetting after a terminal state.

I still don't fully understand the sampling procedure, but from what I can tell based on sample_episodes(), it looks like episodes which end prematurely are padded with transitions from other episodes until you have sampled batch_size sequences of length batch_length.

For example, suppose batch_size=1 and batch_length=10, and the first episode you sample only has 3 transitions, e.g., (s_1, s_2), (s_2, s_3), (s_3, s_4). After the agent reaches terminal state s_4, the environment resets, and you obtain another episode of length 10, say, s'_1, ..., s'_10. Could we then train using a sequence such as (s_1, s_2), ..., (s_3, s_4), (s'_1, s'_2), ..., (s'_6, s'_7)? That is, is it okay to combine sequences from different episodes, even though the episodes may have been played in completely different environments?

Thanks for your time and for an amazing port of Dreamer!

NM512 / dreamerv3-torch

[Question] How does episode sampling handle environment resets? #37