AIRL benchmarking against adamgleave/inverse_rl

shwang commented 5 years ago

To benchmark AIRL, I’m (planning on) comparing imitation performance of our new AIRL implementation on modern gym envs against the performance of the old AIRL implementation (on adamgleve/inverse_rl) on old+analogous gym envs.

I would like to compare the following envs (against their old+analogous equivalents):

Standard:
CartPole-v1
MountainCar-v0
Acrobat-v1
HalfCheetah-v2
Hopper-v2
Walker2d
Ant-v2
Humanoid-v2
Reacher-v2

Custom:
imitation/TwoDMaze-v0
imitation/CustomAnt-v0  # Ant with lower gear ratio (makes flipping over less likely).
imitation/DisabledAnt-v0  # Ant with lower gear ratio and two shorter legs.

In addition I’m planning on testing "self-transfer learning" performance of all the environments. This entails training AIRL on the expert demonstrations, and then using the reward learned from AIRL to train a new expert on the same environment, and finally evaluating the average return of the new expert.

Finally, I'm planning on testing transfer learning performance for the TwoDMaze and Ant environments (like in the AIRL paper).

AIRL imitation learning performance script
- [ ] this repo
- [ ] other repo
self-transfer learning performance script
- [ ] this repo
- [ ] other repo
transfer learning performance script
- [ ] this repo
- [ ] other repo

shwang commented 5 years ago

@AdamGleave I think it will be easier to benchmark adamgleave/inverse_rl if I convert the scripts/ into Sacred format. Does this sound about right?

I think this conversion should be pretty quick because for the most part I will just be copying Sacred code over from this repo.

edit: Also, would appreciate any thoughts on the general plan

AdamGleave commented 5 years ago

I think the difference in reward obtained by RL algorithms varies a fair bit between some of the old and new versions of the environment. So I'd be concerned about making a direct comparison. If possible I'd suggest running adamgleave/inverse_rl on the modern environments. If not possible then compare in terms of % of expert performance, not absolute numbers.

I like the idea of a transfer learning test. I'd expect this to be quite fiddly to get working, though. I think the results in the paper required a fair bit of hyerparameter tuning, and tricks like early stopping. Still worth doing but don't assume there's anything wrong if you can't reproduce.

Adding Sacred support should be fairly easy. Nice if the config parameters can line up fairly closely, you might be able to use one bash script to run both sets of tests, just switch which Python script you're calling. I don't have an opinion on whether that'll be overall easier or not. Parsing the log output and getting that in a consistent format is probably going to be the harder part.

AdamGleave commented 4 years ago

This issue looks stale, should we close?

HumanCompatibleAI / imitation

AIRL benchmarking against adamgleave/inverse_rl #70