araffin / rl-baselines-zoo

A collection of 100+ pre-trained RL agents using Stable Baselines, training and hyperparameter optimization included.
https://stable-baselines.readthedocs.io/
MIT License
1.12k stars 206 forks source link

Reproducible zoo result #120

Closed blurLake closed 2 years ago

blurLake commented 2 years ago

Hi,

I am trying to have reproducible and deterministic results from zoo hyperparameter optimization. Right now I only have one set of hyperparameters as a candidate (attached below).

def sample_sac_params(trial):
    """
    Sampler for SAC hyperparams.

    :param trial: (optuna.trial)
    :return: (dict)
    """
    gamma = trial.suggest_categorical('gamma', [0.01])
    learning_rate = trial.suggest_categorical('lr', [.5e-3])
    learning_starts = trial.suggest_categorical('learning_starts', [1])
    batch_size = trial.suggest_categorical('batch_size', [256])
    buffer_size = trial.suggest_categorical('buffer_size', [int(1e4)])
    train_freq = trial.suggest_categorical('train_freq', [1])
    tau = trial.suggest_categorical('tau', [0.001])
    # gradient_steps takes too much time
    # gradient_steps = trial.suggest_categorical('gradient_steps', [1, 100, 300])
    gradient_steps = train_freq
    ent_coef = trial.suggest_categorical('ent_coef', [0.05])
    net_arch = trial.suggest_categorical('net_arch', ["wide"])
    action_noise = trial.suggest_categorical('action_noise', [None])
    random_exploration = trial.suggest_categorical('random_exploration', [0.0])

    net_arch = {
        'small': [64,64],
        'premedium': [128,128],
        'medium': [256, 256],
        'big': [400, 300],
        'deep': [400, 400, 300],
        'wide': [700,600],
        'widedeep': [700,700,600],
    }[net_arch]

    target_entropy = 'auto'
    if ent_coef == 'auto':
        target_entropy = trial.suggest_categorical('target_entropy', ['auto', -1, -10]) #, -20, -50, -100])

    return {
        'gamma': gamma,
        'learning_rate': learning_rate,
        'batch_size': batch_size,
        'buffer_size': buffer_size,
        'tau': tau,
        'learning_starts': learning_starts,
        'train_freq': train_freq,
        'gradient_steps': gradient_steps,
        'ent_coef': ent_coef,
        'target_entropy': target_entropy,
        'policy_kwargs': dict(layers=net_arch),
        'action_noise': action_noise,
        'random_exploration': random_exploration
    }

I set the seed at the beginning of train.py as the following

# set the seed
# disable GPU
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = ""
seed_value= 0
# 1. Set `PYTHONHASHSEED` environment variable at a fixed value
os.environ['PYTHONHASHSEED']=str(seed_value)

# 2. Set `python` built-in pseudo-random generator at a fixed value
import random
random.seed(seed_value)

# 3. Set `numpy` pseudo-random generator at a fixed value
np.random.seed(seed_value)

# 4. Set the `tensorflow` pseudo-random generator at a fixed value
tf.random.set_random_seed(seed_value)

# 5. Configure a new global `tensorflow` session
from tensorflow.keras import backend as K
session_conf = tf.ConfigProto(intra_op_parallelism_threads=1, inter_op_parallelism_threads=1)
sess = tf.Session(graph=tf.get_default_graph(), config=session_conf)
K.set_session(sess)

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--env', type=str, default="CartPole-v1", help='environment ID')
    parser.add_argument('-tb', '--tensorboard-log', help='Tensorboard log dir', default='', type=str)
    parser.add_argument('-i', '--trained-agent', help='Path to a pretrained agent to continue training',
                        default='', type=str)
    parser.add_argument('--algo', help='RL Algorithm', default='ppo2',
                        type=str, required=False, choices=list(ALGOS.keys()))
    parser.add_argument('-n', '--n-timesteps', help='Overwrite the number of timesteps', default=-1,
                        type=int)
    parser.add_argument('--log-interval', help='Override log interval (default: -1, no change)', default=-1,
                        type=int)
    parser.add_argument('--eval-freq', help='Evaluate the agent every n steps (if negative, no evaluation)',
                        default=10000, type=int)
    parser.add_argument('--eval-episodes', help='Number of episodes to use for evaluation',
                        default=5, type=int)
    parser.add_argument('--save-freq', help='Save the model every n steps (if negative, no checkpoint)',
                        default=-1, type=int)
    parser.add_argument('-f', '--log-folder', help='Log folder', type=str, default='logs')
    parser.add_argument('--seed', help='Random generator seed', type=int, default=0)
    parser.add_argument('--n-trials', help='Number of trials for optimizing hyperparameters', type=int, default=10)
    parser.add_argument('-optimize', '--optimize-hyperparameters', action='store_true', default=False,
                        help='Run hyperparameters search')
    parser.add_argument('--n-jobs', help='Number of parallel jobs when optimizing hyperparameters', type=int, default=1)
    parser.add_argument('--sampler', help='Sampler to use when optimizing hyperparameters', type=str,
                        default='tpe', choices=['random', 'tpe', 'skopt'])
    parser.add_argument('--pruner', help='Pruner to use when optimizing hyperparameters', type=str,
                        default='median', choices=['halving', 'median', 'none'])
    parser.add_argument('--verbose', help='Verbose mode (0: no output, 1: INFO)', default=1,
                        type=int)
    parser.add_argument('--gym-packages', type=str, nargs='+', default=[],
                        help='Additional external Gym environemnt package modules to import (e.g. gym_minigrid)')
    parser.add_argument('-params', '--hyperparams', type=str, nargs='+', action=StoreDict,
                        help='Overwrite hyperparameter (e.g. learning_rate:0.01 train_freq:10)')
    parser.add_argument('-uuid', '--uuid', action='store_true', default=False,
                        help='Ensure that the run has a unique ID')
    parser.add_argument('--env-kwargs', type=str, nargs='+', action=StoreDict,
                        help='Optional keyword argument to pass to the env constructor')
    args = parser.parse_args()

    set_global_seeds(args.seed)

I also modified set_global_seed() in misc_util.py as follows (adding python seed and disabling GPU, just in case)

def set_global_seeds(seed):
    """
    set the seed for python random, tensorflow, numpy and gym spaces

    :param seed: (int) the seed
    """
    tf.set_random_seed(seed)
    np.random.seed(seed)
    random.seed(seed)
    # prng was removed in latest gym version
    if hasattr(gym.spaces, 'prng'):
        gym.spaces.prng.seed(seed)

    # set the seed
    # disable GPU
    import os
    os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
    os.environ["CUDA_VISIBLE_DEVICES"] = ""

    # 1. Set `PYTHONHASHSEED` environment variable at a fixed value
    os.environ['PYTHONHASHSEED']=str(seed)

    # 2. Set the `tensorflow` pseudo-random generator at a fixed value
    tf.random.set_random_seed(seed)

There are some duplications, but it should be fine if the seed=0. Then I run train.py with seed = 0 as the following

I expect since the hyperparameters are the same, different trials should give the same reward. But it is not the case. See the picture attached. Even though the hyperparameters are the same, the results from trial 0 and 1 are different.

Screenshot 2022-02-07 at 12 10 42

blurLake commented 2 years ago

A follow-up to show that the environment is deterministic and reproducible. I set up the seed in a python script called SAC_Tuning_test_zoo.py as the following

# disable GPU
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = ""
seed_value= 0
# 1. Set `PYTHONHASHSEED` environment variable at a fixed value
os.environ['PYTHONHASHSEED']=str(seed_value)

# 2. Set `python` built-in pseudo-random generator at a fixed value
import random
random.seed(seed_value)

# 3. Set `numpy` pseudo-random generator at a fixed value
np.random.seed(seed_value)

# 4. Set the `tensorflow` pseudo-random generator at a fixed value
tf.random.set_random_seed(seed_value)

# 5. Configure a new global `tensorflow` session
from tensorflow.keras import backend as K
session_conf = tf.ConfigProto(intra_op_parallelism_threads=1, inter_op_parallelism_threads=1)
sess = tf.Session(graph=tf.get_default_graph(), config=session_conf)
K.set_session(sess)

and then build the model with seed=0

model = SAC(CustomSACPolicy, env, gamma = 0.1,  tau = 0.005, learning_rate=0.000533201801295971, buffer_size=10000, action_noise=None, verbose=1, batch_size = 512,tensorboard_log="./zoo_repro_tensorboard/",\
            ent_coef=0.05, train_freq=2, random_exploration=0.0, seed=0, learning_starts=1)

Here I run two separate training for 20 timesteps with seed =0, and print out the reward from each step. Two training has exact the same reward as can be seen from the attached picture. Screenshot 2022-02-07 at 11 35 45

blurLake commented 2 years ago

Not sure if this is a SB or optuna issue, but would be great to have some suggestions from you. Thank you very much!

blurLake commented 2 years ago

The solution is to add seed in sac hyperparams candidate list explicitly since the default one is none. Now the different trials have the same value (since there is only one hyperparameter combination).