DLR-RM / stable-baselines3

PyTorch version of Stable Baselines, reliable implementations of reinforcement learning algorithms.
https://stable-baselines3.readthedocs.io
MIT License
8.35k stars 1.6k forks source link

[Feature Request] Allow users to define gradient steps as a fraction of rollout time-steps #1920

Closed janakact closed 1 week ago

janakact commented 2 months ago

🚀 Feature

Currently, SB3 algorithms allow you to define the number of gradient steps $= -1$, which will translate into the number of timesteps in the rollout, let's call it $k$.

However, allowing the user to define gradient_steps $=-2, -4$ or $-8$ which translates into consecutively $k/2, k/4, k/8$ gradient steps, will be more beneficial. This will allow users to write simpler hyper-parameter tuning code as well.

Implementation is pretty simple, change off-policy-algorithm L344:

gradient_steps = self.gradient_steps if self.gradient_steps >= 0 else rollout.episode_timesteps
# should be changed to:
gradient_steps = self.gradient_steps if self.gradient_steps >= 0 else rollout.episode_timesteps/(-self.gradient_steps)

Motivation

Simplify hyperparameter tuning. We always take the number of gradient steps as a fraction of the training frequency. (See rl-zoo code)

Pitch

This wouldn't break existing functionalities, and the code change is pretty simple.

Alternatives

No response

Additional context

No response

Checklist

araffin commented 2 months ago

hello, please don't forget about alternatives too (you ticked the box).

janakact commented 1 month ago

@araffin sorry, I couldn't think of alternatives. Apologies for the mistake. I unticked the checkbox.

araffin commented 1 month ago

Simplify hyperparameter tuning. We always take the number of gradient steps as a fraction of the training frequency. (See rl-zoo code)

the code in the RL Zoo is only one line of code, I'm not sure how much it would simplify things.

However, allowing the user to define gradient_steps or which translates into consecutively

I have to say using -2 to mean /2 seems rather unintuitive (-1 is a special case as it is intentionally an invalid value).

Alternatives

As you pointed out:

n_steps = train_freq * n_envs
sub_factor = 2
gradient_steps = max(n_steps // sub_factor, 1)

seems a pretty simple and flexible alternative

swierh commented 2 weeks ago

An alternative way to solve this is similar to scikit-learn's train_test_split that uses the data type:

Though for sklearn, they don't have the issue of 2.0*train_freq and 2 gradient steps both being valid options.