Clarification/Question about pretraining with RL and BC in Unity

Unity-Technologies / ml-agents

The Unity Machine Learning Agents Toolkit (ML-Agents) is an open-source project that enables games and simulations to serve as environments for training intelligent agents using deep reinforcement learning and imitation learning.

https://unity.com/products/machine-learning-agents

Other

16.93k stars 4.14k forks source link

Clarification/Question about pretraining with RL and BC in Unity #2943

Closed mgb249 closed 4 years ago

mgb249 commented 4 years ago

Hi,

I was testing the offline_bc_config.yaml to use behavioral cloning for a fairly simple task, and it worked. However, I also used the pretraining 'flag/section' in trainer_config (ppo), and sac_config (soft actor critic) and I tried to replicate the results by setting the pretrain iterations to '0' (according to the documentation this will make it run just BC) to essentially disable the RL, and verify that it was performing behavioral cloning. However, neither converged. Is loss being calculated the same way in the BC model and the pretraining, or am I doing something improper in how I am formatting the yaml files?

Below are the configurations: For offline_bc default: trainer: offline_bc batch_size: 64 summary_freq: 1000 max_steps: 5.0e4 batches_per_epoch: 10 use_recurrent: false hidden_units: 10 learning_rate: .001 num_layers: 1 sequence_length: 32 memory_size: 256 demo_path: ./demos/TargetBallChase.demo

default: trainer: sac batch_size: 64 buffer_size: 12800 buffer_init_steps: 0 hidden_units: 10 init_entcoef: 1.0 learning_rate: .001 learning_rate_schedule: constant max_steps: 5.0e4 memory_size: 256 num_update: 1 train_interval: 1 num_layers: 1 time_horizon: 64 sequence_length: 64 summary_freq: 1000 tau: 0.005 use_recurrent: false vis_encode_type: simple reward_signals: extrinsic: strength: 0.9 gamma: 0.01 pretraining: demo_path: ./demos/manualgod.demo strength: .7 steps: 0

Also, as a follow up, for Behavioral Cloning, once the demo file is created, is training completely independent of the environment (i.e. it's just training on input and output examples in the demo file itself)?

Regards,

Michael

robinerd commented 4 years ago

Regarding your last question I have recently seen some strange behaviour that agents use the camera in the scene even though I expect the visual observations should just be stored in the demo file (hence independent of environment while training). I filed an issue (#2944) and will await clarifications there :) Also subbing to this in case you find something! :+1:

EDIT: Forgot to ask, do you also use visual observations?

mgb249 commented 4 years ago

I did not use visual observations, so it is possible it might be treating those inputs differently. Anyway, hopefully they'll answer my question, or yours :).

If I don't hear back in a few days, I might just bit the bullet and run a debugger on the python code to see what is getting passed in tensorflow. :/

ervteng commented 4 years ago

Hi @mgb249, setting the pretraining_steps to 0 doesn't disable the RL - you also have to set the reward strength of your extrinsic reward to 0. We're working on combining BC and pretraining into one feature - stay tuned!

mgb249 commented 4 years ago

Hi Ervin,

Thanks! I'll test it out.

mgb249 commented 4 years ago

Hi @mgb249, setting the pretraining_steps to 0 doesn't disable the RL - you also have to set the reward strength of your extrinsic reward to 0. We're working on combining BC and pretraining into one feature - stay tuned!

Also, about my second question, just to clarify, is it ONLY using the demo file (in both BC and for pretraining) - i.e. the environment is not being used ?

ervteng commented 4 years ago

It is using the environment to do inference and collect the rewards for reporting to Tensorboard (and so you can see what the agent is up to!). Currently there's no way to run ml-agents without the environment, but technically speaking the environment isn't being used to learn.

mgb249 commented 4 years ago

@erikfrey Hi, thanks for the clarification.

So I ended up testing RL with pretraining and did as you specified, and there are some weird effects. Firstly, for SAC, it uses pretraining, BUT, only when I specify the number of steps > 0, but it is not converging (which makes me wonder if parameters used in RL are somehow impacting pretraining, i.e. some parameter that is affecting the weight updates in pretraining??). For PPO, it just doesn't do BC regardless of what I set :/

If you have an example config file/setting or scene where you've verified that pretraining works for PPO and SAC, can you share them? I'm wondering if there is some setting I'm using that is unintentionally causing it to not use pretraining for PPO. Likewise, for at least SAC, I'm pretty befuddled why I'm getting such divergent performance. If you have two example configs to create equivalent behavior in SAC pretraining and offline BC, that would be great!

The only thing of note is that the version i'm using according to GIT log is 0.10.1 (I believe it is pulled on october 21st)

I'll try a few more variations over the weekend.

I really appreciate the help!

Best,

Michael

mgb249 commented 4 years ago

Hi,

I just wanted to follow up on my last question. Over the weekend we tried validate that the results are different in regards to behavioral cloning, I trained 10 models in three conditions (30 in total): Pretrain-Sac (ONLY use pretrain), Pretrain-PPO (ONLY use pretrain), BC. The PPO failed to ever get a reward and based on run time speed, it didn't even use the pretrain, the SAC failed to hit the reward, although observationally it did appear to do some non-random movement, and BC consistently got a reward.

The task is fairly simple, and again I wanted to validate that pretraining works in the same way as in BC assuming the same hyperparameters. If you have an example config file/setting or scene where you've verified that pretraining works for PPO and SAC and gets the same results, can you share them?

At this point I might bite bullet and just go through the python code and see how loss and weight updates are being calculated in BC versus the pretraining component in the RL models.

Regards,

Michael

ervteng commented 4 years ago

Hi @mgb249, the loss function is actually the same for pretraining and BC - it's the same update, only performed on a model that's also used for PPO.

When you say only used pretraining, what are the hyperparameters? Hopefully I can recreate your issue on our side. For instance, on our example env Hallway, these two configs produce roughly the same result:

Hallway:
    trainer:    ppo
    batch_size: 128
    beta:   0.01
    buffer_size:    1024
    epsilon:    0.2
    hidden_units:   128
    lambd:  0.95
    learning_rate:  0.0003
    max_steps:  5.0e5
    memory_size:    256
    normalize:  False
    num_epoch:  3
    num_layers: 2
    time_horizon:   64
    sequence_length:    64
    summary_freq:   1000
    use_recurrent:  True
    reward_signals: 
      extrinsic:    
        strength:   0.0
        gamma:  0.99
    summary_path:   ./summaries/pretraininghallway_Hallway
    model_path: ./models/pretraininghallway-0/Hallway
    keep_checkpoints:   5
    pretraining:    
      demo_path:    ./demos/ExpertHallway.demo
      strength: 1.0
      steps:    1000000

Hallway:
    trainer: offline_bc
    max_steps: 5.0e5
    num_epoch: 5
    batch_size: 64
    batches_per_epoch: 5
    num_layers: 2
    hidden_units: 128
    sequence_length: 16
    use_recurrent: true
    memory_size: 256
    sequence_length: 32
    demo_path: ./demos/ExpertHallway.demo

Anyways, keep me posted on the experiment. We're considering combining the offline BC and pretraining features, but want to make sure there's no degradation in performance - thank you for trying them out.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had activity in the last 14 days. It will be closed in the next 14 days if no further activity occurs. Thank you for your contributions.

stale[bot] commented 4 years ago

This issue has been automatically closed because it has not had activity in the last 28 days. If this issue is still valid, please ping a maintainer. Thank you for your contributions.

github-actions[bot] commented 3 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.