How does num_epoch affect the training performance in terms of reward?

visuallization commented 1 year ago

Hey there I have a question regarding num_epoch and its effect on the training_performance.

I am currently comparing sf with sb3 and sf is super fast. It is really amazing. I am just confused about num_epoch and its effect on the training performance, as when I increase num_epoch the reward in sb3 goes up but in sf reward becomes less. Do you know what might be happening in this case? I try to have the hyperparameters as similar as possible to make the libs comparable.

visuallization commented 1 year ago

Okay when setting num_epoch to a higher value in sf. I get the warning: KL-divergence is very high Any idea how I can work around this or which hp I need to adjust to make this more stable?

alex-petrenko commented 1 year ago

Hi @visuallization

I think I might have missed this issue, my apologies.

So the KL-divergence is the difference between the action probability distributions generated by the policy BEFORE and AFTER the iteration (num_epochs) or training.

This divergence depends on many things but mainly on the size and number of the gradient steps. Which in turn depends on:

batch size
number of minibatches per epoch
number of epochs
learning rate

Other things like the magnitude of your rewards can affect the size of the step as well, i.e. if you have really large rewards this will have an effect.

Learning rate, batch size and rewards magnitude being equal, it boils down to number of minibatches per epoch and number of epochs, which will determine the total number of SGD iterations.

You can quickly estimate your number of update steps by looking at version_diff_max in your wandb/tensorboard

any chance you can share your sb3 and sf2 parameters?

visuallization commented 1 year ago

@alex-petrenko Thanks again for your efforts and detailed answer! I really appreciate it.

I'll share my sf and sb3 parameters.

sf:

{
  "help": false,
  "algo": "APPO",
  "env": "gdrl",
  "experiment": "AirPlatformerPlus_00_num_batches_per_epoch_8_batch_size_64_rollout_32_num_epochs_10_num_workers_1",
  "train_dir": "logs/sf",
  "restart_behavior": "resume",
  "device": "gpu",
  "seed": null,
  "num_policies": 1,
  "async_rl": true,
  "serial_mode": false,
  "batched_sampling": false,
  "num_batches_to_accumulate": 2,
  "worker_num_splits": 2,
  "policy_workers_per_policy": 1,
  "max_policy_lag": 1000,
  "num_workers": 1,
  "num_envs_per_worker": 2,
  "batch_size": 64,
  "num_batches_per_epoch": 8,
  "num_epochs": 10,
  "rollout": 32,
  "recurrence": 1,
  "shuffle_minibatches": false,
  "gamma": 0.99,
  "reward_scale": 1.0,
  "reward_clip": 1000.0,
  "value_bootstrap": false,
  "normalize_returns": true,
  "exploration_loss_coeff": 0.005,
  "value_loss_coeff": 0.5,
  "kl_loss_coeff": 0.0,
  "exploration_loss": "entropy",
  "gae_lambda": 0.95,
  "ppo_clip_ratio": 0.2,
  "ppo_clip_value": 1.0,
  "with_vtrace": false,
  "vtrace_rho": 1.0,
  "vtrace_c": 1.0,
  "optimizer": "adam",
  "adam_eps": 1e-05,
  "adam_beta1": 0.9,
  "adam_beta2": 0.999,
  "max_grad_norm": 0.5,
  "learning_rate": 0.0003,
  "lr_schedule": "constant",
  "lr_schedule_kl_threshold": 0.008,
  "lr_adaptive_min": 1e-06,
  "lr_adaptive_max": 0.01,
  "obs_subtract_mean": 0.0,
  "obs_scale": 1.0,
  "normalize_input": true,
  "normalize_input_keys": null,
  "decorrelate_experience_max_seconds": 0,
  "decorrelate_envs_on_one_worker": true,
  "actor_worker_gpus": [],
  "set_workers_cpu_affinity": true,
  "force_envs_single_thread": false,
  "default_niceness": 0,
  "log_to_file": true,
  "experiment_summaries_interval": 3,
  "flush_summaries_interval": 30,
  "stats_avg": 100,
  "summaries_use_frameskip": true,
  "heartbeat_interval": 20,
  "heartbeat_reporting_interval": 180,
  "train_for_env_steps": 1000000,
  "train_for_seconds": 10000000000,
  "save_every_sec": 120,
  "keep_checkpoints": 2,
  "load_checkpoint_kind": "latest",
  "save_milestones_sec": -1,
  "save_best_every_sec": 5,
  "save_best_metric": "reward",
  "save_best_after": 100000,
  "benchmark": false,
  "encoder_mlp_layers": [
    512,
    512
  ],
  "encoder_conv_architecture": "convnet_simple",
  "encoder_conv_mlp_layers": [
    512
  ],
  "use_rnn": false,
  "rnn_size": 512,
  "rnn_type": "gru",
  "rnn_num_layers": 1,
  "decoder_mlp_layers": [],
  "nonlinearity": "relu",
  "policy_initialization": "orthogonal",
  "policy_init_gain": 1.0,
  "actor_critic_share_weights": true,
  "adaptive_stddev": true,
  "continuous_tanh_scale": 0.0,
  "initial_stddev": 1.0,
  "use_env_info_cache": false,
  "env_gpu_actions": false,
  "env_gpu_observations": true,
  "env_frameskip": 1,
  "env_framestack": 4,
  "pixel_format": "CHW",
  "use_record_episode_statistics": true,
  "with_wandb": false,
  "wandb_user": null,
  "wandb_project": "sample_factory",
  "wandb_group": null,
  "wandb_job_type": "SF",
  "wandb_tags": [],
  "with_pbt": false,
  "pbt_mix_policies_in_one_env": true,
  "pbt_period_env_steps": 5000000,
  "pbt_start_mutation": 20000000,
  "pbt_replace_fraction": 0.3,
  "pbt_mutation_rate": 0.15,
  "pbt_replace_reward_gap": 0.1,
  "pbt_replace_reward_gap_absolute": 1e-06,
  "pbt_optimize_gamma": false,
  "pbt_target_objective": "true_objective",
  "pbt_perturb_min": 1.1,
  "pbt_perturb_max": 1.5,
  "base_port": 21440,
  "env_agents": 16,
  "experiment_dir": "logs/sf",
  "experiment_name": null,
  "command_line": "--env=gdrl --train_for_env_steps=1000000 --num_workers=1 --learning_rate=0.0003 --exploration_loss_coeff=0.005 --lr_schedule=constant --num_epochs=10 --batch_size=64 --num_batches_per_epoch=8 --rollout=32",
  "cli_args": {
    "env": "gdrl",
    "num_workers": 1,
    "batch_size": 64,
    "num_batches_per_epoch": 8,
    "num_epochs": 10,
    "rollout": 32,
    "exploration_loss_coeff": 0.005,
    "learning_rate": 0.0003,
    "lr_schedule": "constant",
    "train_for_env_steps": 1000000
  },
  "git_hash": "09fc75edd27cc1057c0e2fd042da1c11a4ed24a0",
  "git_repo_name": "git@github.com:edbeeching/godot_rl_agents.git"
}

sb3:

        policy: "MultiInputPolicy",
        learning_rate: Union[float, Schedule] = 3e-4,
        n_steps: int = 32,
        batch_size: int = 64,
        n_epochs: int = 10,
        gamma: float = 0.99,
        gae_lambda: float = 0.95,
        clip_range: Union[float, Schedule] = 0.2,
        clip_range_vf: Union[None, float, Schedule] = None,
        ent_coef: float = 0.005,
        vf_coef: float = 0.5,
        max_grad_norm: float = 0.5,
        use_sde: bool = False,
        sde_sample_freq: int = -1,
        target_kl: Optional[float] = None,
        tensorboard_log: Optional[str] = None,
        create_eval_env: bool = False,
        policy_kwargs: Optional[Dict[str, Any]] = None,
        verbose: int = 0,
        seed: Optional[int] = None,
        device: Union[th.device, str] = "auto",
        _init_setup_model: bool = True,

sf will print warning about KL-divergence and the agent will not train well wherease in sb3 the agent has no problem in learning, even with epochs:10. But maybe another parameter is making it hard for sf.

If I am not mistaken in sf buffer_size = num_batches_per_epoch * batch_size and in sb3 buffer_size = num_steps * num_envs where num_envs = 16 in my environment.

So in my example total buffer_size should be 512 in both libs, if I am not mistaken

alex-petrenko commented 1 year ago

Just to clarify, are you using a vectorized environment with 16 agents? Because in sf2 you have num_workers=1 which means only one process simulating environments.

Some differences in configuration:

async_rl=True -- you can set this to False to match sb3 config. Sb3 does not support async training (i.e. training while collecting more experience). This can decrease the FPS a bit, but it will reduce policy lag, i.e. you will train on less stale experience
num_envs_per_worker=2 - you can set this to 1. With 2 it will create two env instances per worker resulting in more policy lag (stale experience). SB3 does not have this feature so you can set it to 1 to match sb3.
normalize_returns and normalize_inputs. Again, SB3 does not have these features so you can set both to False to match SB3 config. Normalization of returns and observations massively helps on some environments and does not help in others. Hard to predict what you need to use.

General comment about this cfg: your batch size seems to be low and number of epochs high. Instead of 64 and 10, try 128 and 5, or 256 and 3. In my experience it generally works better, especially the longer you train. Also, if you can afford more than 16 agents, definitely try more i.e. by increasing the number of workers. More agents is usually better.

visuallization commented 1 year ago

@alex-petrenko Thanks again for you feedback!!

I am currently trying to play around with the different settings and try to reproduce the sb3 performance.

There is one big thing I noticed though:

I created a few runs with a new config and just changed the number of workers and this alone changed the accumulated reward dramatically but not the way I thought it would. With 8 workers I get a lower reward than with 1 worker. Obviously 8 workers train a lot faster than 1 worker and this kind of seems to impact the accumulated reward, in terms the faster it trains the kess rewards the agents can accumulate. Does this make sense, do you know an explanation for this?

You can find the logs here (thanks to your great wandb integration :))

https://wandb.ai/visuallization/sample_factory/groups/gdrl/workspace?workspace=user-visuallization

If you take a look at AirPlatformerPlus_rollout_128_batch_size_128_num_batches_16_epochs_2_workers_1_20230721_153751_425804 which was trained with 1 worker and 1M train_for_env_steps:

You get a reward of around 1400 and 250 steps.

If you take a look at AirPlatformerPlus_rollout_128_batch_size_128_num_batches_16_epochs_2_workers_8_20230721_152033_014767 which was trained with the same config and 1M train_for_env_steps but 8 workers:

You get a reward on only around 300 and 60 steps.

I am not sure what steps excactly corresponds to 250 steps and 60 steps corresponds to because it is obviously different from the 1M train_for_env_steps which would get visuallized in tensorboard.

alex-petrenko commented 1 year ago

changed the number of workers and this alone changed the accumulated reward dramatically but not the way I thought it would. With 8 workers I get a lower reward than with 1 worker. Obviously 8 workers train a lot faster than 1 worker and this kind of seems to impact the accumulated reward, in terms the faster it trains the kess rewards the agents can accumulate. Does this make sense, do you know an explanation for this?

By increasing the number of workers you increase the amount of experience that is collected in one iteration of training. I.e. 10 agents with 64 rollout length give 640 steps, 100 parallel agents with 128 rollout steps give 12800 experience steps.

This experience is then divided into minibatches according to your configuration. If you don't change the batch size and the rollout, increasing the number of agents by 10x will increase the number of learning steps you're doing on this experience by 10x as well. If your training is sensitive to policy lag, this can be detrimental.

Let's say you increased n_agents by 10x without changing anything else. Suppose agent 1's rollout is finished and is added to training data, and you train on it. Agent 1 starts collecting the next rollout. By the time you process experience from all other agents you would've done 10x more SGD steps, so when the next rollout from agent 1 is ready for training, some of this experience will be very stale (at least the beginning of the rollout was collected with 10x older policy)

alex-petrenko commented 1 year ago

I have no clue what "step" actually refers to either. Perhaps the number of gradient steps? Both WandB and tensorboard weren't originally designed with RL in mind so they report this random "step" value on X-axis by default.

I would say you should never use "step", it's just some random metric that is mostly the artifact of your training system. You don't care how many "steps" you really did, only what level of results you got after let's say an hour or day of training.

Sometimes using "global_step" makes sense, which is the number of env steps consumed by training. For example if you want to understand sample efficiency.

But unless you plan to train in the real world on a real robot, my suggestion is to always use "Relative time (Wall)" on the X-axis. You can easily change this in your WandB workspace by clicking x-> icon in the top right corner.

If you plot it like that, it becomes clear what configuration is better. It is --num_workers=8 --num_epochs=2 --rollout=128 --batch_size=1024 --num_batches_per_epoch=16

It is also clear that 1M frames is a tiny amount of experience for this task. Give it waay more sample budget, like 100M or 1B.

It often makes sense to increase the batch size and/or reduce rollout when dealing with more parallel agents.

So my suggestion, try: --num_workers=8 --num_epochs=2 --rollout=64 --batch_size=1024 --num_batches_per_epoch=8 --num_workers=8 --num_epochs=2 --rollout=32 --batch_size=1024 --num_batches_per_epoch=4 --num_workers=8 --num_epochs=2 --rollout=32 --batch_size=2048 --num_batches_per_epoch=2

All of these run 16 parallel agents in total, 8 workers times 2 envs per worker. To go further, try playing with num_envs_per_worker:

32 envs: --num_workers=8 --num_envs_per_worker=4 --num_epochs=2 --rollout=32 --batch_size=2048 --num_batches_per_epoch=4 64 envs: --num_workers=8 --num_envs_per_worker=8 --num_epochs=2 --rollout=32 --batch_size=4096 --num_batches_per_epoch=4

alex-petrenko commented 1 year ago

SF2 is a bit more advanced when it comes to configuration - it gives you way more freedom but it is also easier to mess it up, and there is a learning curve.

I wish I had time to make it streamlined and easy to use + keep the ability to use sophisticated advanced configurations. Maybe if I have more time to work on it in the future! :)

Questions like yours in this thread really help me understand what kinds of issues users are dealing with, and it's hard to predict in advance when you're developing by yourself and for your own needs. So thank you for posting here!

Also, RL really loves scale. I can't overstate how important scale is. Don't be afraid to max out your hardware. If you have memory for it, run more envs and bigger batches, for longer.

Some of the most impressive RL results that I got were achieved by running on 20K+ parallel sims for like a trillion steps: https://sites.google.com/view/dexpbt

visuallization commented 1 year ago

@alex-petrenko Thank you so much for all your help, suggestions and background knowledge. It really helps me understand what is going on underneath the hood.

Your first suggestions worked really well as you can see here:

--num_workers=8 --num_epochs=2 --rollout=64 --batch_size=1024 --num_batches_per_epoch=8

--num_workers=8 --num_epochs=2 --rollout=32 --batch_size=1024 --num_batches_per_epoch=4

--num_workers=8 --num_epochs=2 --rollout=32 --batch_size=2048 --num_batches_per_epoch=2

I also tried training the last configuration with 10M steps instead of 1M and it is still learning and achieves higher rewards:

It is really impressive how fast sample factory is and even more impressive that you basically develop this single handedly. It really makes RL much nicer to work with as you can iterate much faster. So big kudos!

I am still struggling to achieve the same or higher rewards with your last configuration:

#32 envs
--num_workers=8 --num_envs_per_worker=4 --num_epochs=2 --rollout=32 --batch_size=2048 --num_batches_per_epoch=4

#64 envs
--num_workers=8 --num_envs_per_worker=8 --num_epochs=2 --rollout=32 --batch_size=4096 --num_batches_per_epoch=4

But I guess it is a matter of again adjusting rollout, batch_size and num_batches_per_epoch

btw: really impressive results in your paper!

alex-petrenko commented 1 year ago

Glad it helped!

alex-petrenko / sample-factory

How does num_epoch affect the training performance in terms of reward? #276