Eclectic-Sheep / sheeprl

Distributed Reinforcement Learning accelerated by Lightning Fabric
https://eclecticsheep.ai
Apache License 2.0
274 stars 26 forks source link

About Hafner `train_ratio` and general `replay_ratio` #223

Closed belerico closed 3 months ago

belerico commented 4 months ago

In Dreamer-V3 paper Hafner defines the train_ratio as

Table A.1: Benchmark overview. The train ratio is the number of replayed steps per policy steps rather than environment steps, and thus unaware of the action repeat.

where the replayed_steps = batch_size * seq_len = 16 * 64 = 1024 by default.

In the literature, the accepted quantity that express the number of agent updates per environment interaction (or policy step) is called replay_ratio.

Moreover Hafner, in his official code, computes exactly the replay_ratio as replay_ratio = train_ratio / replayed_steps, which in the case of DMC vision environments is equal to replay_ratio = 512 / (16 * 64) = 512 / 1024 = 1 / 2, i.e. 1 gradient step every 2 policy steps.

I propose to combine the algo.train_every and algo.per_rank_gradient_steps of every off-policy method with the single parameter algo.replay_ratio, which represents the global replay ratio, unaware of the existence of multiple processes in distributed setting and multiple copies of the environment in a single process; for those reasons we must account for that in our code:

class Ratio:
    """Directly taken from Hafner et al. (2023) implementation:
    https://github.com/danijar/dreamerv3/blob/8fa35f83eee1ce7e10f3dee0b766587d0a713a60/dreamerv3/embodied/core/when.py#L26
    """

    def __init__(self, ratio):
        assert ratio >= 0, ratio
        self._ratio = ratio
        self._prev = None

    def __call__(self, step):
        step = int(step)
        if self._ratio == 0:
            return 0
        if self._prev is None:
            self._prev = step
            return 1
        repeats = round((step - self._prev) * self._ratio)
        self._prev += repeats / self._ratio
        return repeats

if __name__ == "__main__":
    num_envs = 4
    world_size = 1
    replay_ratio = 0.5
    per_rank_batch_size = 16
    per_rank_sequence_length = 64
    replayed_steps = world_size * per_rank_batch_size * per_rank_sequence_length
    train_steps = 0
    gradient_steps = 0
    total_policy_steps = 2**16
    r = Ratio(ratio=replay_ratio)
    policy_steps = num_envs * world_size
    for i in range(0, total_policy_steps, policy_steps):
        per_rank_repeats = r(i / world_size)
        gradient_steps += per_rank_repeats * world_size
    print("Replay ratio", replay_ratio)
    print("Hafner train ratio", replay_ratio * replayed_steps)
    print("Final ratio", gradient_steps / total_policy_steps)

which prints

Replay ratio 0.5
Hafner train ratio 512.0
Final ratio 0.4999847412109375

cc @michele-milesi

verityw commented 3 months ago

Just making sure I'm understanding right: replay ratio = train ratio / (batch size seq len) Thus, to achieve a particular replay ratio, you set per_rank_batch_size, per_rank_sequence_length to the corresponding values listed above (in the Dreamer v3 paper, those would be 16 and 64 respectively), then choose train_every and per_rank_gradient_steps such that you have: `train_every replay ratio = per_rank_gradient_steps`

Is that right? Just wanted to clarify, since train_every and per_rank_gradient_steps do not appear in the above example code.

michele-milesi commented 3 months ago

Hi @verityw, we set the per_rank_batch_size and the per_rank_sequence_length parameters as suggested in the paper. Here we are proposing to replace our train_every and per_rank_gradient_steps with the replay_ratio parameter. As it is now implemented, to achieve the correct replay ratio, train_every and per_rank_gradient_step must be set taking into account: the number of environments and which replay ratio you want to use for training. This is inconvenient and involves having to do some calculations before setting these two parameters (train_every and per_rank_gradient_steps).

For example, suppose we want to set the replay ratio to 0.5:

With the replay_ratio parameter, these calculations are made automatically.

belerico commented 3 months ago

Yep, the train_every and per_rank_gradient_steps does not appear can be both represented by the per_rank_repeats like this:

class Ratio:
    """Directly taken from Hafner et al. (2023) implementation:
    https://github.com/danijar/dreamerv3/blob/8fa35f83eee1ce7e10f3dee0b766587d0a713a60/dreamerv3/embodied/core/when.py#L26
    """

    def __init__(self, ratio):
        assert ratio >= 0, ratio
        self._ratio = ratio
        self._prev = None

    def __call__(self, step):
        step = int(step)
        if self._ratio == 0:
            return 0
        if self._prev is None:
            self._prev = step
            return 1
        repeats = round((step - self._prev) * self._ratio)
        self._prev += repeats / self._ratio
        return repeats

if __name__ == "__main__":
    num_envs = 1
    world_size = 1
    replay_ratio = 0.5
    per_rank_batch_size = 16
    per_rank_sequence_length = 64
    replayed_steps = world_size * per_rank_batch_size * per_rank_sequence_length
    train_steps = 0
    gradient_steps = 0
    total_policy_steps = 2**5
    r = Ratio(ratio=replay_ratio)
    policy_steps = num_envs * world_size
    for i in range(0, total_policy_steps, policy_steps):
        per_rank_repeats = r(i / world_size)
        if per_rank_repeats > 0:
            print(
                f"Training the agent with {per_rank_repeats} repeats on every rank "
                f"({per_rank_repeats * world_size} global repeats) at global iteration {i}"
            )
        gradient_steps += per_rank_repeats * world_size
    print("Replay ratio", replay_ratio)
    print("Hafner train ratio", replay_ratio * replayed_steps)
    print("Final ratio", gradient_steps / total_policy_steps)

which prints

Training the agent with 1 repeats on every rank (1 global repeats) at global iteration 0
Training the agent with 1 repeats on every rank (1 global repeats) at global iteration 2
Training the agent with 1 repeats on every rank (1 global repeats) at global iteration 4
Training the agent with 1 repeats on every rank (1 global repeats) at global iteration 6
Training the agent with 1 repeats on every rank (1 global repeats) at global iteration 8
Training the agent with 1 repeats on every rank (1 global repeats) at global iteration 10
Training the agent with 1 repeats on every rank (1 global repeats) at global iteration 12
Training the agent with 1 repeats on every rank (1 global repeats) at global iteration 14
Training the agent with 1 repeats on every rank (1 global repeats) at global iteration 16
Training the agent with 1 repeats on every rank (1 global repeats) at global iteration 18
Training the agent with 1 repeats on every rank (1 global repeats) at global iteration 20
Training the agent with 1 repeats on every rank (1 global repeats) at global iteration 22
Training the agent with 1 repeats on every rank (1 global repeats) at global iteration 24
Training the agent with 1 repeats on every rank (1 global repeats) at global iteration 26
Training the agent with 1 repeats on every rank (1 global repeats) at global iteration 28
Training the agent with 1 repeats on every rank (1 global repeats) at global iteration 30
Replay ratio 0.5
Hafner train ratio 512.0
Final ratio 0.5

Regarding your other question:

Just making sure I'm understanding right: replay ratio = train ratio / (batch size seq len) Thus, to achieve a particular replay ratio, you set per_rank_batch_size, per_rank_sequence_length to the corresponding values listed above (in the Dreamer v3 paper, those would be 16 and 64 respectively), then choose train_every and per_rank_gradient_steps such that you have: `train_every replay ratio = per_rank_gradient_steps`

Exactly, but we also need to consider the world size (how many different processes running, coming from the fabric.devices=N) and the number of environments per process (coming from env.num_envs=M), so the best way is to fix the train_every=N*M and compute the per_rank_gradient_steps accordingly (having fixed the per_rank_batch_size=16 and per_rank_sequence_length=64 if you want to maintain the same replay-ratio as Hafner).

We hope with the single replay_ratio parameter to simply the experimentation from the user perspective, since 1) we are aligned with the literature 2) we encapsulate two parameters in one