Closed janakact closed 1 week ago
hello, please don't forget about alternatives too (you ticked the box).
@araffin sorry, I couldn't think of alternatives. Apologies for the mistake. I unticked the checkbox.
Simplify hyperparameter tuning. We always take the number of gradient steps as a fraction of the training frequency. (See rl-zoo code)
the code in the RL Zoo is only one line of code, I'm not sure how much it would simplify things.
However, allowing the user to define gradient_steps or which translates into consecutively
I have to say using -2
to mean /2
seems rather unintuitive (-1 is a special case as it is intentionally an invalid value).
Alternatives
As you pointed out:
n_steps = train_freq * n_envs
sub_factor = 2
gradient_steps = max(n_steps // sub_factor, 1)
seems a pretty simple and flexible alternative
An alternative way to solve this is similar to scikit-learn's train_test_split that uses the data type:
Though for sklearn, they don't have the issue of 2.0*train_freq
and 2
gradient steps both being valid options.
🚀 Feature
Currently, SB3 algorithms allow you to define the number of gradient steps $= -1$, which will translate into the number of timesteps in the rollout, let's call it $k$.
However, allowing the user to define gradient_steps $=-2, -4$ or $-8$ which translates into consecutively $k/2, k/4, k/8$ gradient steps, will be more beneficial. This will allow users to write simpler hyper-parameter tuning code as well.
Implementation is pretty simple, change off-policy-algorithm L344:
Motivation
Simplify hyperparameter tuning. We always take the number of gradient steps as a fraction of the training frequency. (See rl-zoo code)
Pitch
This wouldn't break existing functionalities, and the code change is pretty simple.
Alternatives
No response
Additional context
No response
Checklist