Closed LYK-love closed 2 months ago
Hi @LYK-love, as you said the algorithm you posted refers to the Dreamer-V1 one, which by the way is very similar to the one you shared here. The code you refer to is the one of Dreamer-V3, which is quite different from the V1 version, especially from the insights we have gained from looking at the authors code.
As pointed out in #218 we have run experiments and we are matching the Dreamer-V3 paper.
In #223 we are considering adopting the more generalized and accepted replay_ratio
instead of the Hafner train_ratio
(which is related) to better match the Dreamer-V3 implementation.
For your last question: please refer first to the #223, for which we have removed the per_rank_gradient_steps
and updates_before_learning
and replace it with the replay_ratio
; also, have you tried to look at the how-to where we explain the policy_steps and everything related?
Hi @LYK-love,
as @belerico said, we are moving from our way of computing the replay_ratio
(with train_every
and per_rank_gradient_steps
parameters) to a more standard way.
In any case, I would like to answer your questions:
update_steps
of the pseudo-code you provided is equivalent to our per_rank_gradient_steps
. The sample function is called 1 time, but we set the parameter n_samples = cfg.algo.per_rank_gradient_steps
. The sample function returns an object with dimension (per_rank_gradient_steps, sequence_length, batch_size, ...
) where ...
means the dimension of the observations/actions/reward/done.
After that, a "for cycle" is performed on the first dimension of this object, so you perform cfg.algo.per_rank_gradient_steps
updates.train_step
variable counts how many times the function train is called (globally if you are training with more than one GPU) and it is used only for logging.num_updates
is the total number of iterations that a single process has to perform (is the outer loop "while not converged" in the pseudo-code). It is computed as cfg.algo.total_steps // (num_envs * world_size)
, so it takes into account the number of GPUs and the number of environments.updates_before_training
is the variable that specifies how many policy steps to play between one training and the next one: it is the T
of the DreamerV1 pseudo-code. Again, it takes into account the number of environments and the number of GPUs you are using for training.if update >= learning_starts and updates_before_training <= 0
checks if we have to train or not. (The learning_starts
is the number of steps to perform before starting training)Perhaps what has led you astray is that in this implementation we do not have an outer loop with two loops inside (one for environment interaction and the other for training), but we do have a loop for environment interaction and inside it, we check whether we need to carry out training. This choice was made to follow the original repository as closely as possible.
So the structure of our code is (let me change the name of the variables, for better understanding):
# counter of policy steps played, when zero, then you have to train the agent
initialize env_interaction_steps_between_trainings
for i in total_steps:
env.step(action)
env_interaction_steps_between_trainings -= 1
if train_started and env_interaction_steps_between_trainings <= 0:
for j in per_rank_gradient_steps:
train()
reset env_interaction_steps_between_trainings
Another thing I noticed is that num_updates
and updates_before_training
are imprecise names, we will fix them for clarity.
Let me know if it is clearer now.
Thanks
Thanks! I'll do more work on sheeprl and try to reproduce the results of the oiginial paper.
@michele-milesi So for each iteration in for i in total_steps:
, only one step of env interaction is performed, right? While in the pseudocode, T
steps will be performed in one iteration of the outermost while loop. I think T
is the episode length.
Yeah, at each iteration of the outer for-loop, one step of env interaction is performed.
You can obtain the same behaviour of the pseudocode by properly setting the env_interaction_steps_between_trainings
variable (our train_every
). If set to 10
, it means that you perform 10
steps of env interaction between one training and the next one.
Remember that the Dreamer aims to be sample efficient (the fewer steps of env interaction you do between one training session and the next, the better).
Hello, I found followig code in
sheeprl/algos/dreamer_v3.py
:Based on my understanding, the training procedure of DreamerV3 is the same as DreamerV1, as is shown in DreamerV1's paper:
Basically, we need to:
while not converged
loop, in each iteration we:for i in range(update_steps)
loop, in each iteration we:So, I think sheeprl's
train()
function is the same as: one time of dynamic learning + one time of behavior learning. It should be called forupdate_steps
for a for loop, and the for loop should be called multiple times before the agent converges.However, in the code I provided in the beginning, I didn't see the
train()
function is called forupdate_steps
for a for loop, and I didn't see the outermost while loop. Meanwhile, I didn't find that, after each for loop, an episode is collected and added to the replay buffer.I think sheeprl's implementation is a little different from the paper. Can you explain it?
What's more, can you explain what are :
train_step
,per_rank_gradient_steps
,if update >= learning_starts and updates_before_training <= 0
,updates_before_training
,num_updates
? I also can't understand the logic of this piece of codeThanks!