araffin / rl-baselines-zoo

A collection of 100+ pre-trained RL agents using Stable Baselines, training and hyperparameter optimization included.
https://stable-baselines.readthedocs.io/
MIT License
1.12k stars 208 forks source link

[question] Architecture Search #57

Closed jarlva closed 4 years ago

jarlva commented 4 years ago

I understand that there is a way to tune hyperparameter. Is there a way to tune the actual model (number of layers and tensors)? If not, is it possible to integrate something like adanet?

araffin commented 4 years ago

hello,

You mean optimizing the model architecture? yes, it is possible, you need to change the sampler script a bit and pass a policy_kwargs=dict(net_arch=[64,64]) (or layers= for SAC/DQN...) to the constructor (cf doc).

jarlva commented 4 years ago

Thanks Antonin,

Yes, optimizing the model architecture (tensors, layers, etc..) I'm new to SB and tried some things ( https://stable-baselines.readthedocs.io/en/master/guide/custom_policy.html). Yet, it's not clear how exactly to tune the model (via optuna I assume). Would it be possible to get a simple example (like cartpole)?

Much appreciated! Jake

eunomiadev commented 4 years ago

is this what you want?

import numpy as np
from stable_baselines.common.vec_env import SubprocVecEnv
from stable_baselines import PPO2
from stable_baselines.common.policies import MlpLnLstmPolicy
import optuna

n_cpu = 4

def optimize_ppo2(trial):
    """ Learning hyperparamters we want to optimise"""
    return {
        'n_steps': int(trial.suggest_loguniform('n_steps', 16, 2048)),
        'gamma': trial.suggest_loguniform('gamma', 0.9, 0.9999),
        'learning_rate': trial.suggest_loguniform('learning_rate', 1e-5, 1.),
        'ent_coef': trial.suggest_loguniform('ent_coef', 1e-8, 1e-1),
        'cliprange': trial.suggest_uniform('cliprange', 0.1, 0.4),
        'noptepochs': int(trial.suggest_loguniform('noptepochs', 1, 48)),
        'lam': trial.suggest_uniform('lam', 0.8, 1.)
    }

def optimize_agent(trial):
    """ Train the model and optimise
        Optuna maximises the negative log likelihood, so we
        need to negate the reward here
    """
    model_params = optimize_ppo2(trial)
    env = SubprocVecEnv([lambda: gym.make('CartPole-v1') for i in range(n_cpu)])
    model = PPO2(MlpLnLstmPolicy, env, verbose=0, nminibatches=1, **model_params)
    model.learn(10000)

    rewards = []
    n_episodes, reward_sum = 0, 0.0

    obs = env.reset()
    while n_episodes < 4:
        action, _ = model.predict(obs)
        obs, reward, done, _ = env.step(action)
        reward_sum += reward

        if done:
            rewards.append(reward_sum)
            reward_sum = 0.0
            n_episodes += 1
            obs = env.reset()

    last_reward = np.mean(rewards)
    trial.report(-1 * last_reward)

    return -1 * last_reward

if __name__ == '__main__':
    study = optuna.create_study(study_name='cartpol_optuna', storage='sqlite:///params.db', load_if_exists=True)
    study.optimize(optimize_agent, n_trials=1000, n_jobs=1)
jarlva commented 4 years ago

Thanks for the script Eunomia! That has been very helpful!

Is there a place to define and tune the tensorflow model layers/tensors? For example, in Keras the model is defined by: model = Sequential() ; model.add(Dense(32, input_dim=784)) model.add(Activation('relu'))

There is something a bit less simple in tensorflow. Now, optimizing the model (tensors/layers and activation) to a specific problem can yield remarkable results/speed-up. To that end, Google came with up with adanet AutoML - a way to automatically find/tune the best tensorflow model (not sure how to apply it in RL). Is there a way to tune model's tensors/layers/activation (maybe by modifying the script above) via optuna (or maybe adanet)?

araffin commented 4 years ago

@jheffez

The code you are looking for (and that @eunomiadev wrote) is here.

Is there a place to define and tune the tensorflow model layers/tensors?

Please read the documentation for that (especially "custom policy" part). A quick example:

model = PPO2('MlpPolicy', 'CartPole-v1', policy_kwargs=dict(net_arch=[256, 256]))

with optuna:

def optimize_ppo2(trial):
    """ Learning hyperparamters we want to optimise"""
    net_arch = trial.suggest_categorical('net_arch', ['small', 'medium'])
    net_arch = {
        'small': [dict(pi=[64, 64], vf=[64, 64])],
        'medium': [dict(pi=[256, 256], vf=[256, 256])],
    }[net_arch]
    return {
        'policy_kwargs': dict(net_arch=net_arch),
    }

I also recommend to read optuna documentation, you should find an answer to your questions ;)

jarlva commented 4 years ago

Thanks again! I'll check it out.