Generate expert demonstrations and upload to S3

shwang commented 5 years ago

Checklist:

[x] Parallelized training for experts (Adam finished this via #57).
[x] Get good experts for Humanoid and Ant which are not doing well right now.
[x] Script for uploading experts to S3 in the proper format.

The GAIL paper (Appendix B) reported benchmarks for the following environments:

To compare our GAIL implementation against the paper (presuming that this is the comparison that we want to do in the first place), we need expert demonstrations for all these environments. Unfortunately, stable baselines zoo uses PyBullet/Roboschool for all the robotics tasks, so we only have stable baselines experts for the first 3 environments in the table above.

@qxcv and I are thinking of generating demonstrations from openai/baselines experts and confirming that they achieve the same reward as reported in the GAIL paper. Any thoughts, @AdamGleave?

AdamGleave commented 5 years ago

These environments are all easy RL tasks, you don't need to use a pre-trained expert, easy to train your own. Stable Baselines PPO2 implementation should work for all of them. Hyperparameters given in the Zoo for similar environments should work OK, there are also defaults in run_mujoco.py. Expect to run for up to 10M timesteps. Using VecNormalize may be important in some of these. Humanoid is the only one that might be a little tricky. Had good luck with SAC, but I think PPO should be able to do it.

Of course openai/baselines will work fine too, but it seems harder to integrate with the rest of the codebase, so I'm not sure I see the advantage.

Note the Gym environments are all a version older, using MuJoCo 1.31. Probably fine to use the new versions, just don't expect to match performance exactly.

There is also a GAIL implementation in both OpenAI and Stable Baselines, in addition to an open-source reference implementation of GAIL: http://github.com/openai/imitation I think Baselines did some benchmarking before their GAIL PR was merged so may be worth looking at that too.

shwang commented 5 years ago

These environments are all easy RL tasks, you don't need to use a pre-trained expert, easy to train your own.

@qxcv and I have had a lot of trouble getting PPO2 to work on simple environments. I think I had to run CartPole 5 times on our Sacred setttings to get an expert, and Sam couldn't get Pendulum to work last week iirc.

shwang commented 5 years ago

Of course openai/baselines will work fine too, but it seems harder to integrate with the rest of the codebase, so I'm not sure I see the advantage.

I was thinking that after generating demonstrations in the correct format, I could just drop those into a data/ folder or something and avoid adding experts to the repo.

I (and consequently Sam) mistakenly thought that openai already had pickled expert models for Mujoco tasks! I can only find the results of a cronjob showing training curves in the baselines repo.

At this point stable-baselines training seems like the best solution.

AdamGleave commented 5 years ago

OK, getting good experts might be something I can help with.

One thing missing in our data_collect script (I think I added a TODO) is VecNormalize, which is important for a lot of MuJoCo tasks. Our hyperparameters may also be off.

Of course RL is always somewhat high variance, so you may need to run it with a few random seeds.

AdamGleave commented 5 years ago

I just checked and things seem to work out-of-the-box with Stable Baselines ppo2/run_mujoco.py script. I made some modifications to save things and make a test runner, see this branch: https://github.com/HumanCompatibleAI/baselines/tree/mujoco-experts

The only differences I can see from imitation.data_collect are:

We're using FeedForward32Policy, a two-layer 32 hidden unit policy, with policy and value networks sharing weights. This could be important, especially in more complex environments. Surprised it'd be a problem in Cartpole!
We're not using VecNormalize. I've found this to be very important in the past. Especially because it rescales and clips reward (which has the effect of changing learning rate).
We're using 8 vector environments, each with n_steps=256, versus 1 environment, with n_steps=2048. This shouldn't make much of a difference (Adam Stooke was able to scale number of environments up to well beyond 64 IIRC for PPO before performance degraded.)
We had ent_coef=0.01 (default) rather than ent_coef=0.00. Could be important in more complex environments.

I've addressed these differences in #57, so we should be able to perform the training in our repo. Running the tests now and will report when I have results. Ideally we'd do an ablation to figure out which of these changes is important, but I'm not planning on doing that personally.

I was thinking that after generating demonstrations in the correct format, I could just drop those into a data/ folder or something and avoid adding experts to the repo.

Let's avoid committing any binary files (rollouts or experts) to the repo. It's ok to have one or two just for unit tests, but if we add lots it will notably increase the size of the Git repository and slow everything down. And then we'll inevitably have to change the format at some point and recommit everything, but the old ones will remain versioned forever, etc.

Let's instead store them in S3 and add a script to sync them. If we absolutely must use Git then let's at least use Git LFS.

AdamGleave commented 5 years ago

Here are results from Stable Baselines ppo2/run_mujoco.py for 1 million timesteps, default hyperparameters:

==> parallel/env/Acrobot-v1/seed/0/stdout <==
| ep_reward_mean     | -70.6        |
==> parallel/env/Acrobot-v1/seed/1/stdout <==
| ep_reward_mean     | -72.9        |
==> parallel/env/Acrobot-v1/seed/2/stdout <==
| ep_reward_mean     | -75.6         |
==> parallel/env/Ant-v2/seed/0/stdout <==
| ep_reward_mean     | 1.58e+03     |
==> parallel/env/Ant-v2/seed/1/stdout <==
| ep_reward_mean     | 756          |
==> parallel/env/Ant-v2/seed/2/stdout <==
| ep_reward_mean     | 764          |
==> parallel/env/CartPole-v0/seed/0/stdout <==
| ep_reward_mean     | 199          |
==> parallel/env/CartPole-v0/seed/1/stdout <==
| ep_reward_mean     | 200           |
==> parallel/env/CartPole-v0/seed/2/stdout <==
| ep_reward_mean     | 199           |
==> parallel/env/HalfCheetah-v2/seed/0/stdout <==
| ep_reward_mean     | 1.45e+03      |
==> parallel/env/HalfCheetah-v2/seed/1/stdout <==
| ep_reward_mean     | 1.47e+03      |
==> parallel/env/HalfCheetah-v2/seed/2/stdout <==
| ep_reward_mean     | 1.45e+03     |
==> parallel/env/Hopper-v2/seed/0/stdout <==
| ep_reward_mean     | 2.34e+03     |
==> parallel/env/Hopper-v2/seed/1/stdout <==
| ep_reward_mean     | 2.33e+03     |
==> parallel/env/Hopper-v2/seed/2/stdout <==
| ep_reward_mean     | 2.83e+03      |
==> parallel/env/Humanoid-v2/seed/0/stdout <==
| ep_reward_mean     | 701         |
==> parallel/env/Humanoid-v2/seed/1/stdout <==
| ep_reward_mean     | 600         |
==> parallel/env/Humanoid-v2/seed/2/stdout <==
| ep_reward_mean     | 494          |
==> parallel/env/MountainCar-v0/seed/0/stdout <==
| ep_reward_mean     | -113          |
==> parallel/env/MountainCar-v0/seed/1/stdout <==
| ep_reward_mean     | -106         |
==> parallel/env/MountainCar-v0/seed/2/stdout <==
| ep_reward_mean     | -100          |
==> parallel/env/Reacher-v2/seed/0/stdout <==
| ep_reward_mean     | -5.48        |
==> parallel/env/Reacher-v2/seed/1/stdout <==
| ep_reward_mean     | -5.41        |
==> parallel/env/Reacher-v2/seed/2/stdout <==
| ep_reward_mean     | -6.99       |

Eyeballing it they look pretty close to that reported in the PPO paper; some differences are to be expected given the Gym/MuJoCo version change. Ant and Humanoid (which PPO paper does not report on) seem underwhelming, however they would likely benefit from being trained for more than a million timesteps. See SAC paper for SOTA results and for benchmarks of PPO for longer periods of time (their PPO gets up to 6000 for Humanoid, 2000 for Ant)

Full logs and final policies attached.

stable_baselines.zip

AdamGleave commented 5 years ago

@shwang I actually got Humanoid and Ant working pretty well just by increasing the batch size to 8*2048=16384, but it seems I forgot to ever post about this. Opened #66

shwang commented 5 years ago

Closed by #100

HumanCompatibleAI / imitation

Generate expert demonstrations and upload to S3 #56