alex-petrenko / sample-factory

High throughput synchronous and asynchronous reinforcement learning
https://samplefactory.dev
MIT License
835 stars 113 forks source link

Model "remembers" instead of learning #260

Open jarlva opened 1 year ago

jarlva commented 1 year ago

Hey, after training (~200M) showing good reward, Enjoy shows bad reward numbers on unseen data. When including the training data in Enjoy the reward matches training. So, it seems the model "remembers" the data, as opposed to learning.

What's the best way to deal with that (other than adding more data and introducing random noise)? Are there settings to try?

Training a gym-like env with the following:

{
  "help": false,
  "algo": "APPO",
  "env": "Myrl-v0",
  "experiment": "0114-1156.2-62",
  "train_dir": "./train_dir",
  "restart_behavior": "resume",
  "device": "gpu",
  "seed": 5,
  "num_policies": 1,
  "async_rl": true,
  "serial_mode": false,
  "batched_sampling": false,
  "num_batches_to_accumulate": 2,
  "worker_num_splits": 2,
  "policy_workers_per_policy": 1,
  "max_policy_lag": 1000,
  "num_workers": 32,
  "num_envs_per_worker": 28,
  "batch_size": 1024,
  "num_batches_per_epoch": 1,
  "num_epochs": 1,
  "rollout": 32,
  "recurrence": 1,
  "shuffle_minibatches": false,
  "gamma": 0.99,
  "reward_scale": 1.0,
  "reward_clip": 1000.0,
  "value_bootstrap": false,
  "normalize_returns": true,
  "exploration_loss_coeff": 0.003,
  "value_loss_coeff": 0.5,
  "kl_loss_coeff": 0.0,
  "exploration_loss": "entropy",
  "gae_lambda": 0.95,
  "ppo_clip_ratio": 0.1,
  "ppo_clip_value": 1.0,
  "with_vtrace": false,
  "vtrace_rho": 1.0,
  "vtrace_c": 1.0,
  "optimizer": "adam",
  "adam_eps": 1e-06,
  "adam_beta1": 0.9,
  "adam_beta2": 0.999,
  "max_grad_norm": 4.0,
  "learning_rate": 0.0001,
  "lr_schedule": "constant",
  "lr_schedule_kl_threshold": 0.008,
  "obs_subtract_mean": 0.0,
  "obs_scale": 1.0,
  "normalize_input": true,
  "normalize_input_keys": null,
  "decorrelate_experience_max_seconds": 0,
  "decorrelate_envs_on_one_worker": true,
  "actor_worker_gpus": [],
  "set_workers_cpu_affinity": true,
  "force_envs_single_thread": false,
  "default_niceness": 0,
  "log_to_file": true,
  "experiment_summaries_interval": 10,
  "flush_summaries_interval": 30,
  "stats_avg": 100,
  "summaries_use_frameskip": true,
  "heartbeat_interval": 20,
  "heartbeat_reporting_interval": 180,
  "train_for_env_steps": 985000000,
  "train_for_seconds": 10000000000,
  "save_every_sec": 60,
  "keep_checkpoints": 1,
  "load_checkpoint_kind": "best",
  "save_milestones_sec": -1,
  "save_best_every_sec": 15,
  "save_best_metric": "7.ARGPB",
  "save_best_after": 20000000,
  "benchmark": false,
  "encoder_mlp_layers": [
    512,
    512
  ],
  "encoder_conv_architecture": "convnet_simple",
  "encoder_conv_mlp_layers": [
    512
  ],
  "use_rnn": false,
  "rnn_size": 512,
  "rnn_type": "gru",
  "rnn_num_layers": 1,
  "decoder_mlp_layers": [],
  "nonlinearity": "elu",
  "policy_initialization": "orthogonal",
  "policy_init_gain": 1.0,
  "actor_critic_share_weights": true,
  "adaptive_stddev": true,
  "continuous_tanh_scale": 0.0,
  "initial_stddev": 1.0,
  "use_env_info_cache": false,
  "env_gpu_actions": false,
  "env_gpu_observations": true,
  "env_frameskip": 1,
  "env_framestack": 1,
  "pixel_format": "CHW",
  "use_record_episode_statistics": false,
  "with_wandb": false,
  "wandb_user": null,
  "wandb_project": "sample_factory",
  "wandb_group": null,
  "wandb_job_type": "SF",
  "wandb_tags": [],
  "with_pbt": true,
  "pbt_mix_policies_in_one_env": true,
  "pbt_period_env_steps": 5000000,
  "pbt_start_mutation": 20000000,
  "pbt_replace_fraction": 0.3,
  "pbt_mutation_rate": 0.15,
  "pbt_replace_reward_gap": 0.1,
  "pbt_replace_reward_gap_absolute": 1e-06,
  "pbt_optimize_gamma": false,
  "pbt_target_objective": "true_objective",
  "pbt_perturb_min": 1.1,
  "pbt_perturb_max": 1.5,
  "command_line": "--train_dir=./train_dir --learning_rate=0.0001 --with_pbt=True --save_every_sec=60 --load_checkpoint_kind=best --save_best_every_sec=15 --use_rnn=False --seed=5 --num_envs_per_worker=28 --keep_checkpoints=1 --device=gpu --train_for_env_steps=985000000 --algo=APPO --experiment=0114-1156.2-62 --with_vtrace=False --experiment_summaries_interval=10 --save_best_after=20000000 --recurrence=1 --num_workers=32 --batch_size=1024 --env=Myrl-v0 --save_best_metric=7.ARGPB",
  "cli_args": {
    "algo": "APPO",
    "env": "Myrl-v0",
    "experiment": "0114-1156.2-62",
    "train_dir": "./train_dir",
    "device": "gpu",
    "seed": 5,
    "num_workers": 32,
    "num_envs_per_worker": 28,
    "batch_size": 1024,
    "recurrence": 1,
    "with_vtrace": false,
    "learning_rate": 0.0001,
    "experiment_summaries_interval": 10,
    "train_for_env_steps": 985000000,
    "save_every_sec": 60,
    "keep_checkpoints": 1,
    "load_checkpoint_kind": "best",
    "save_best_every_sec": 15,
    "save_best_metric": "7.ARGPB",
    "save_best_after": 20000000,
    "use_rnn": false,
    "with_pbt": true
  },
  "git_hash": "cf6f93c8109e48faf7bca746ce2184808f6513c1",
  "git_repo_name": "not a git repository",
  "train_script": "train_gym_env2"
}
alex-petrenko commented 1 year ago

You're encountering a general machine learning problem called "overfitting". It is generally a challenge to make sure a model generalizes beyond training distribution, and it is not specific to RL or Sample Factory.

Some things to look at:

  1. Look up general anti-overfitting techniques from deep learning. Dropout, larger learning rate come to mind. Although I haven't had much success with these.
  2. Domain randomization. Make sure your training distribution is as diverse as possible, so it is harder to overfit. Randomize parameters of the environment where possible. Check out some automatic domain randomization ideas from this paper https://dextreme.org/ and papers it references.
  3. Data augmentation. Making training distribution larger always helps. Augment training scenarios to provide more data. Augment observations (i.e. for visual observations you can use crop, change colors, flip images, and use other techniques from computer vision)
  4. Noise injection can help (i.e. injecting noise into obs. and actions)
  5. Adversarial learning and self-play can help if this is applicable to your setting
  6. Use population-based training and use a performance metric (true_objective) which is a proxy for generalization performance (i.e. performance on unseen data).
jarlva commented 1 year ago

Thanks again for your reply @alex-petrenko !

jarlva commented 1 year ago

I tried the following but none worked. I'd like to try dropout and noticed it's possible to apply in pytorch but not sure how to do it in the SF2 code (maybe add an optional parameter)?

update: also tried editing sample-factory/tests/test_precheck.py with lines 15, 18 image

. adding noise to observations, up to +/-5% . PBT . simplify the model to 256,256 . changed LR to 0.00001 and 0.001, from default 0.0001 . increased data from 30k to 100k rows . it's not possible to augment data

jarlva commented 1 year ago

Hi @alex-petrenko , would it be possible to reply to the latest request from 2 days ago, above?

alex-petrenko commented 1 year ago

I think your best option is to implement a custom model (encoder only should be sufficient, but you can override the entire actor-critic module). See the documentation here: https://www.samplefactory.dev/03-customization/custom-models/

Just add dropout as a layer and fingers crossed it should work. You should be careful about eval() and train() modes for your PyTorch module but I think you should already be covered here. See here for example: https://discuss.pytorch.org/t/if-my-model-has-dropout-do-i-have-to-alternate-between-model-eval-and-model-train-during-training/83007/2

alex-petrenko commented 1 year ago

Hmmm I guess your confusion might be from the fact that Dropout can't be just added as a model layer, you have to actually call it explicitly in forward()

If I were you I would simply modify the code of forward() method of the actor_critic class to call dropout when needed.

Sorry, I don't think I can properly help you with the problem without knowing context and details of your problem. Overfitting is one of the hardest problems in all of ML and there's no single magical recipe for fixing it.

jarlva commented 1 year ago

Hi @alex-petrenko , sorry, I'm not an expert at this. I'm using a customized cartpole-like gym env. Do you mean edit sample_factory/model/actor_critic.py in the following, lines 154, 184?

1/30 Update: Also updated sample_factory/model/encoder.py lines 216, 221

Also, would it make sense to add dropout as a switch option?

image image

alex-petrenko commented 1 year ago

First thing I would try would be to add dropout after each layer in the encoder. If you're using a cartpole-like environment, then you would need to modify MLP Encoder which is defined here: https://github.com/alex-petrenko/sample-factory/blob/86332022b489f9253cbaf8f71f8d49b47d765036/sample_factory/model/encoder.py#L72

Convolutional encoder probably has nothing to do with your task if your observations are just vectors of numbers. Convolutional encoder is for the images.

jarlva commented 1 year ago

I added it in the model_utils.py file, line 52. So the layers are:

RecursiveScriptModule( original_name=Sequential (0): RecursiveScriptModule(original_name=Linear) (1): RecursiveScriptModule(original_name=ELU) (2): RecursiveScriptModule(original_name=Dropout) (3): RecursiveScriptModule(original_name=Linear) (4): RecursiveScriptModule(original_name=ELU) (5): RecursiveScriptModule(original_name=Dropout) )

But, alas, that's still not solving overfitting...

image

alex-petrenko commented 1 year ago

Dropout is one way to combat overfitting but it is not a panacea.

I'm sorry I can't help figure out your exact issue, as I said previously, overfitting is a general machine learning phenomenon and most likely your problem has nothing to do with Sample Factory, but rather with the overall problem formulation and approach.

jarlva commented 1 year ago

Hi @alex-petrenko , I understand. I appreciate the guidance and advice! Please let me know if you'd be open to advise for pay?

alex-petrenko commented 1 year ago

@jarlva not sure if this is realistic right now. I'm starting a full-time job very soon which will keep me busy for a foreseeable future.

You said you're able to fit to your training data, right? That means, trained policy does well on the training data when you're evaluating? But completely fails on out-of-distribution data.

If I could get some ideas what's your environment and what exactly the difference between your training and test data is, I could be more helpful. Maybe we can set up a call in ~2 weeks. Feel free to reach out on Discord DM or by email to discuss further.