DLR-RM / rl-baselines3-zoo

A training framework for Stable Baselines3 reinforcement learning agents, with hyperparameter optimization and pre-trained agents included.
https://rl-baselines3-zoo.readthedocs.io
MIT License
2.01k stars 510 forks source link

[Bug]: ppo_lstm not implemented in hyperparams_opt.py #409

Closed technocrat13 closed 11 months ago

technocrat13 commented 11 months ago

🐛 Bug

There is no direct way to optimize hyperparameters for ppo_lstm (RecurrentPPO) from the command line. It can only be achieved by amending like so in hyperparams_opt.py

HYPERPARAMS_SAMPLER = {
    ...
    "ppo": sample_ppo_params,
    "ppo_lstm": sample_ppo_params,
    ...
}

Interesting to note that ALGO in utils.py does implement ppo_lstm but returns a KeyError when trying to pass 'ppo_lstm' with -optimize/--optimize-hyperparameters

ALGOS: Dict[str, Type[BaseAlgorithm]] = {
    ...
    "ppo_lstm": RecurrentPPO,
}

Am I missing something? As I understand it the hyperparameters themselves do not change too much going from PPO to RecurrentPPO, only their values, hence the same sampling function can be used, unless the policy_kwargsneeds to be changed then a sample_ppo_lstm_params() needs to be implemented, I'd be more than happy to send a pull request!

To Reproduce

python -m rl_zoo3.train --algo ppo_lstm --env LunarLanderNoVel-v2 -n 500 -optimize --n-trials 10 --n-jobs 3 --sampler tpe --pruner median

Relevant log output / Error message

Trial 0 failed with parameters: {} because of the following error: KeyError('ppo_lstm').
Traceback (most recent call last):
  File "C:\Users\~\miniconda3\envs\tf\lib\site-packages\optuna\study\_optimize.py", line 200, in _run_trial
    value_or_values = func(trial)
  File "C:\Users\~\miniconda3\envs\tf\lib\site-packages\rl_zoo3\exp_manager.py", line 737, in objective
    sampled_hyperparams = HYPERPARAMS_SAMPLER[self.algo](trial)
KeyError: 'ppo_lstm'

System Info

pip installed zoo, sb3, and sb3-contrib ensured relevant files are up to date with the latest rl-baselines-zoo/main

Checklist

araffin commented 11 months ago

I'd be more than happy to send a pull request!

Please do =) It is indeed missing.

I think you just need to extend the ppo sampler (have a sample_ppo_lstm() that will call sample_ppo_params()).

"ppo_lstm": sample_ppo_params, is actually already a quick and good solution.

technocrat13 commented 11 months ago

In my implementation of this where sample_ppo_lstm_params() calls sample_ppo_params(), I am encountering a limit in optuna's trail.suggest_categorical()

I am sampling net_arch from ["tiny", "medium", "large"] for the LSTM but in vanilla PPO it is sampling from only ["medium", "large"]

Optuna is unable to suggest categorical variables as it does not support having multiple parameters with same name but different value space, it does not even override it. There are some discussions on implementing it but they are old

Using a new name for the LSTM's net_arch (eg. net_arch_lstm) wastes search space and is an inelegant solution

I have three possible solutions to this:

  1. Not including "tiny" in sample_ppo_lstm_params():

    • does not impact implementation of PPO
    • some environments benefit from a smaller neural network within the lstm and do not require bigger neural networks
  2. Including "tiny" in sample_ppo_params():

    • simplest, most straight forward implementation
    • increases the search space for PPO, although if performance is not adequate it will not be explored
  3. Passing a flag to sample_ppo_params():

    • retains all old functionality and does not impact PPO's search space
    • possibly confusing to read and understand in one go

All these 3 assume that I am extending the function and updating the "policy_kwargs" in the returned dictionary, if I just write a new function these issues simply do not exist.

Which solution according to you is ideal @araffin? And do you have any suggestions of your own? I can look into their implementation.

araffin commented 11 months ago

Solution 1 or 2 are fine for me.