Closed belerico closed 3 months ago
Just making sure I'm understanding right:
replay ratio = train ratio / (batch size seq len)
Thus, to achieve a particular replay ratio, you set per_rank_batch_size, per_rank_sequence_length
to the corresponding values listed above (in the Dreamer v3 paper, those would be 16 and 64 respectively), then choose train_every
and per_rank_gradient_steps
such that you have:
`train_every replay ratio = per_rank_gradient_steps`
Is that right? Just wanted to clarify, since train_every
and per_rank_gradient_steps
do not appear in the above example code.
Hi @verityw,
we set the per_rank_batch_size
and the per_rank_sequence_length
parameters as suggested in the paper.
Here we are proposing to replace our train_every
and per_rank_gradient_steps
with the replay_ratio
parameter.
As it is now implemented, to achieve the correct replay ratio, train_every
and per_rank_gradient_step
must be set taking into account: the number of environments and which replay ratio you want to use for training. This is inconvenient and involves having to do some calculations before setting these two parameters (train_every
and per_rank_gradient_steps
).
For example, suppose we want to set the replay ratio to 0.5:
train_every=1
and per_rank_gradient_steps=2
: at each iteration, you perform 4 policy steps, so 2 gradient_seps / 4 policy_steps = 0.5
.train_every=1
and per_rank_gradient_steps=1
: 1 gradient_sep / 2 policy_steps = 0.5
.train_every=2
and per_rank_gradient_steps=1
: 2 iterations (one policy step each for a total of 2 policy steps) and 1 gradient step -> 1 gradient_step / 2 policy_steps = 0.5
.With the replay_ratio
parameter, these calculations are made automatically.
Yep, the train_every
and per_rank_gradient_steps
does not appear can be both represented by the per_rank_repeats
like this:
class Ratio:
"""Directly taken from Hafner et al. (2023) implementation:
https://github.com/danijar/dreamerv3/blob/8fa35f83eee1ce7e10f3dee0b766587d0a713a60/dreamerv3/embodied/core/when.py#L26
"""
def __init__(self, ratio):
assert ratio >= 0, ratio
self._ratio = ratio
self._prev = None
def __call__(self, step):
step = int(step)
if self._ratio == 0:
return 0
if self._prev is None:
self._prev = step
return 1
repeats = round((step - self._prev) * self._ratio)
self._prev += repeats / self._ratio
return repeats
if __name__ == "__main__":
num_envs = 1
world_size = 1
replay_ratio = 0.5
per_rank_batch_size = 16
per_rank_sequence_length = 64
replayed_steps = world_size * per_rank_batch_size * per_rank_sequence_length
train_steps = 0
gradient_steps = 0
total_policy_steps = 2**5
r = Ratio(ratio=replay_ratio)
policy_steps = num_envs * world_size
for i in range(0, total_policy_steps, policy_steps):
per_rank_repeats = r(i / world_size)
if per_rank_repeats > 0:
print(
f"Training the agent with {per_rank_repeats} repeats on every rank "
f"({per_rank_repeats * world_size} global repeats) at global iteration {i}"
)
gradient_steps += per_rank_repeats * world_size
print("Replay ratio", replay_ratio)
print("Hafner train ratio", replay_ratio * replayed_steps)
print("Final ratio", gradient_steps / total_policy_steps)
which prints
Training the agent with 1 repeats on every rank (1 global repeats) at global iteration 0
Training the agent with 1 repeats on every rank (1 global repeats) at global iteration 2
Training the agent with 1 repeats on every rank (1 global repeats) at global iteration 4
Training the agent with 1 repeats on every rank (1 global repeats) at global iteration 6
Training the agent with 1 repeats on every rank (1 global repeats) at global iteration 8
Training the agent with 1 repeats on every rank (1 global repeats) at global iteration 10
Training the agent with 1 repeats on every rank (1 global repeats) at global iteration 12
Training the agent with 1 repeats on every rank (1 global repeats) at global iteration 14
Training the agent with 1 repeats on every rank (1 global repeats) at global iteration 16
Training the agent with 1 repeats on every rank (1 global repeats) at global iteration 18
Training the agent with 1 repeats on every rank (1 global repeats) at global iteration 20
Training the agent with 1 repeats on every rank (1 global repeats) at global iteration 22
Training the agent with 1 repeats on every rank (1 global repeats) at global iteration 24
Training the agent with 1 repeats on every rank (1 global repeats) at global iteration 26
Training the agent with 1 repeats on every rank (1 global repeats) at global iteration 28
Training the agent with 1 repeats on every rank (1 global repeats) at global iteration 30
Replay ratio 0.5
Hafner train ratio 512.0
Final ratio 0.5
Regarding your other question:
Just making sure I'm understanding right: replay ratio = train ratio / (batch size seq len) Thus, to achieve a particular replay ratio, you set
per_rank_batch_size, per_rank_sequence_length
to the corresponding values listed above (in the Dreamer v3 paper, those would be 16 and 64 respectively), then choosetrain_every
andper_rank_gradient_steps
such that you have: `train_every replay ratio = per_rank_gradient_steps`
Exactly, but we also need to consider the world size (how many different processes running, coming from the fabric.devices=N
) and the number of environments per process (coming from env.num_envs=M
), so the best way is to fix the train_every=N*M
and compute the per_rank_gradient_steps
accordingly (having fixed the per_rank_batch_size=16
and per_rank_sequence_length=64
if you want to maintain the same replay-ratio as Hafner).
We hope with the single replay_ratio
parameter to simply the experimentation from the user perspective, since 1) we are aligned with the literature 2) we encapsulate two parameters in one
In Dreamer-V3 paper Hafner defines the
train_ratio
aswhere the
replayed_steps = batch_size * seq_len = 16 * 64 = 1024
by default.In the literature, the accepted quantity that express the number of agent updates per environment interaction (or policy step) is called
replay_ratio
.Moreover Hafner, in his official code, computes exactly the
replay_ratio
asreplay_ratio = train_ratio / replayed_steps
, which in the case of DMC vision environments is equal toreplay_ratio = 512 / (16 * 64) = 512 / 1024 = 1 / 2
, i.e. 1 gradient step every 2 policy steps.I propose to combine the
algo.train_every
andalgo.per_rank_gradient_steps
of every off-policy method with the single parameteralgo.replay_ratio
, which represents the global replay ratio, unaware of the existence of multiple processes in distributed setting and multiple copies of the environment in a single process; for those reasons we must account for that in our code:which prints
cc @michele-milesi