Closed A-Artemis closed 10 months ago
Hello, could you please provide the hyperparameters you used for SB2 PPO?
Related issues (please have a look): https://github.com/DLR-RM/stable-baselines3/issues/90#issuecomment-742525593 and https://github.com/DLR-RM/stable-baselines3/issues/512#issuecomment-881281399
Here are the hyperparameters used for SB2 PPO
def MlpPolicy(
name=name,
ob_space=obs_space, # same as SB3
ac_space=ac_space, # same as SB3
hid_size=312,
num_hid_layers=2,
num_of_categories=3,
)
pposgd_simple.learn(
env_creator=env, # same env as above
workerseed=seed + 10000 * MPI.COMM_WORLD.Get_rank(), # this was either 4 or 8 threads
policy_fn=MlpPolicy,
max_timesteps=50000000,
timesteps_per_actorbatch=1536,
clip_param=0.2,
entcoeff=0.01,
optim_epochs=4,
optim_stepsize=0.001,
optim_batchsize=512,
gamma=0.99,
lam=0.95,
schedule="linear",
stochastic=True,
)
I see, you are using PPO1 (PPO with MPI). I'm not sure how you translated them to SB3 PPO, some seem quite off (for instance optim_stepsize=0.001
in SB2 PPO but you use learning_rate=0.0005
).
I'm not sure where you got the
hid_size=312,
num_hid_layers=2,
num_of_categories=3,
from as it is not a parameter of PPO1 MlpPolicy.
Same for stochastic
...
Your parameters should translate to:
from typing import Callable
hidden_size = 312
policy_kwargs = {
"log_std_init": 0.0,
"ortho_init": True,
"activation_fn": nn.Tanh,
"net_arch": {
"pi": [hidden_size, hidden_size],
"vf": [hidden_size, hidden_size],
},
# Note: Adam epsilon is 1e-5 by default for SB3 PPO
}
# IMPORTANT: n_envs influences the number of steps collected
n_envs = 8
# make_vec_env(env_id=make_callable_env(), n_envs=n_envs, vec_env_cls=SubprocVecEnv)
## PPO1 has schedule='linear' has a default
def linear_schedule(initial_value: float) -> Callable[[float], float]:
"""
Linear learning rate schedule.
:param initial_value: Initial learning rate.
:return: schedule that computes
current learning rate depending on remaining progress
"""
def func(progress_remaining: float) -> float:
"""
Progress will decrease from 1 (beginning) to 0.
:param progress_remaining:
:return: current learning rate
"""
return progress_remaining * initial_value
return func
model = PPO(
policy="MlpPolicy",
env=envs,
learning_rate=linear_schedule(0.001),
n_steps=1536,
batch_size=512,
n_epochs=4,
gamma=0.99,
gae_lambda=0.95,
ent_coef=0.01,
verbose=1,
clip_range=0.2,
policy_kwargs=policy_kwargs,
max_grad_norm=100, # PPO1 doesn't rescale the gradient apparently
)
Please note that the number of envs in parallel is an important hyperparameter (see notebook in our doc).
Thank you for working out the hyper parameters! I will try these out over the weekend as it takes a day to train.
❓ Question
HI, I am struggling to get PPO to learn effectively on my environment. The reward earned is not smooth and spikes. This is the reward after 7 million steps.
I am using a custom env with these settings:
is_done()
returns True.is_truncated()
returns True.The PPO algorithm is setup with the following parameters:
I have tried to use the Optuna framework (https://optuna.org/) to do some hyperparameter optimization. Changing the network architecture size between 64/128/256, as well as different values of
n_steps
,batch_size
,activation_fn
.... but I have not found a set which is suitable. Hyperparameter optimization is also incredibly time consuming since I expect it to learn well (where the reward is >50% of the agents episode length) within 1,000,000 steps. Reaching 1,000,000 steps takes hours, and adequate learning takes ~10,000,000 steps so with my current hardware it is not feasible to do such a parameter sweep.I have used SB2 with the same env and this learned smoothly
I have had a look at the migration of SB2 to SB3 and copied over the old parameters the best I could, but no success. I also checked out the rl_zoo for inspiration.
I have also checked the tensorboard and nothing seems out of the ordinary.
Is there something that I am missing? Are my hyperparameters poorly chosen? Is there anything additional between between SB2 and SB3? I am stuck changing parameters over and over again, and training takes way too long for me to keep my PC running 24/7.
Checklist