[Question] Help with understanding PPO hyperparameters (SB2 vs SB3)

A-Artemis commented 1 year ago

❓ Question

HI, I am struggling to get PPO to learn effectively on my environment. The reward earned is not smooth and spikes. This is the reward after 7 million steps.

I am using a custom env with these settings:

action_space = spaces.Box(low=0, high=1, shape=(17,))
observation_space = spaces.Box(low=-np.inf, high=np.inf, shape=(94,))

The reward per step is between 0 and 1, the max the agent can earn in a single step is 1, with 0 being the least. So if the agent does 35 perfect steps it has a reward of 35.
The agent can do a max of 885 steps, after that the environment is undefined and is_done() returns True.
If the agent goes out-of-bounds then is_truncated() returns True.

The PPO algorithm is setup with the following parameters:

policy_kwargs = {
    "log_std_init": -2,
    "ortho_init": False,
    "activation_fn": nn.Tanh,
    "net_arch": {
        "pi": [128, 128],
        "vf": [128, 128],
    },
}
model = PPO(
    policy="MlpPolicy",
    env=envs, # make_vec_env(env_id=make_callable_env(), n_envs=32, vec_env_cls=SubprocVecEnv)
    learning_rate=0.0005,
    n_steps=1536,
    batch_size=512,
    n_epochs=4,
    gamma=0.99,
    gae_lambda=0.95,
    ent_coef=0.01,
    verbose=True,
    clip_range=0.2,
    policy_kwargs=policy_kwargs,
)
log = configure(folder="./models", format_strings=["stdout", "csv", "tensorboard"])
model.set_logger(log)
model.learn(total_timesteps=50_000_000, progress_bar=True, log_interval=1)

I have tried to use the Optuna framework (https://optuna.org/) to do some hyperparameter optimization. Changing the network architecture size between 64/128/256, as well as different values of n_steps, batch_size, activation_fn .... but I have not found a set which is suitable. Hyperparameter optimization is also incredibly time consuming since I expect it to learn well (where the reward is >50% of the agents episode length) within 1,000,000 steps. Reaching 1,000,000 steps takes hours, and adequate learning takes ~10,000,000 steps so with my current hardware it is not feasible to do such a parameter sweep.

I have used SB2 with the same env and this learned smoothly

I have had a look at the migration of SB2 to SB3 and copied over the old parameters the best I could, but no success. I also checked out the rl_zoo for inspiration.

I have also checked the tensorboard and nothing seems out of the ordinary.

Is there something that I am missing? Are my hyperparameters poorly chosen? Is there anything additional between between SB2 and SB3? I am stuck changing parameters over and over again, and training takes way too long for me to keep my PC running 24/7.

Checklist

[X] I have checked that there is no similar issue in the repo
[X] I have read the documentation
[X] If code there is, it is minimal and working
[X] If code there is, it is formatted using the markdown code blocks for both code and stack traces.

araffin commented 1 year ago

Hello, could you please provide the hyperparameters you used for SB2 PPO?

A-Artemis commented 1 year ago

Here are the hyperparameters used for SB2 PPO

def MlpPolicy(
  name=name,
  ob_space=obs_space, # same as SB3
  ac_space=ac_space, # same as SB3
  hid_size=312,
  num_hid_layers=2,
  num_of_categories=3,
) 

pposgd_simple.learn(
  env_creator=env, # same env as above
  workerseed=seed + 10000 * MPI.COMM_WORLD.Get_rank(), # this was either 4 or 8 threads
  policy_fn=MlpPolicy,
  max_timesteps=50000000,
  timesteps_per_actorbatch=1536,
  clip_param=0.2,
  entcoeff=0.01,
  optim_epochs=4,
  optim_stepsize=0.001,
  optim_batchsize=512,
  gamma=0.99,
  lam=0.95,
  schedule="linear",
  stochastic=True,
)

araffin commented 1 year ago

I see, you are using PPO1 (PPO with MPI). I'm not sure how you translated them to SB3 PPO, some seem quite off (for instance optim_stepsize=0.001 in SB2 PPO but you use learning_rate=0.0005).

I'm not sure where you got the

  hid_size=312,
  num_hid_layers=2,
  num_of_categories=3,

from as it is not a parameter of PPO1 MlpPolicy. Same for stochastic...

Your parameters should translate to:

from typing import Callable

hidden_size = 312
policy_kwargs = {
    "log_std_init": 0.0,
    "ortho_init": True,
    "activation_fn": nn.Tanh,
    "net_arch": {
        "pi": [hidden_size, hidden_size],
        "vf": [hidden_size, hidden_size],
    },
# Note: Adam epsilon is 1e-5 by default for SB3 PPO
}

# IMPORTANT: n_envs influences the number of steps collected
n_envs = 8
# make_vec_env(env_id=make_callable_env(), n_envs=n_envs, vec_env_cls=SubprocVecEnv)

## PPO1 has schedule='linear' has a default
def linear_schedule(initial_value: float) -> Callable[[float], float]:
    """
    Linear learning rate schedule.

    :param initial_value: Initial learning rate.
    :return: schedule that computes
      current learning rate depending on remaining progress
    """
    def func(progress_remaining: float) -> float:
        """
        Progress will decrease from 1 (beginning) to 0.

        :param progress_remaining:
        :return: current learning rate
        """
        return progress_remaining * initial_value

    return func

model = PPO(
    policy="MlpPolicy",
    env=envs, 
    learning_rate=linear_schedule(0.001),
    n_steps=1536,
    batch_size=512,
    n_epochs=4,
    gamma=0.99,
    gae_lambda=0.95,
    ent_coef=0.01,
    verbose=1,
    clip_range=0.2,
    policy_kwargs=policy_kwargs,
    max_grad_norm=100, # PPO1 doesn't rescale the gradient apparently

)

Please note that the number of envs in parallel is an important hyperparameter (see notebook in our doc).

A-Artemis commented 1 year ago

Thank you for working out the hyper parameters! I will try these out over the weekend as it takes a day to train.

DLR-RM / stable-baselines3

[Question] Help with understanding PPO hyperparameters (SB2 vs SB3) #1746

❓ Question

Checklist