Closed shwang closed 3 years ago
@AdamGleave I think it will be easier to benchmark adamgleave/inverse_rl
if I convert the scripts/
into Sacred format. Does this sound about right?
I think this conversion should be pretty quick because for the most part I will just be copying Sacred code over from this repo.
edit: Also, would appreciate any thoughts on the general plan
I think the difference in reward obtained by RL algorithms varies a fair bit between some of the old and new versions of the environment. So I'd be concerned about making a direct comparison. If possible I'd suggest running adamgleave/inverse_rl
on the modern environments. If not possible then compare in terms of % of expert performance, not absolute numbers.
I like the idea of a transfer learning test. I'd expect this to be quite fiddly to get working, though. I think the results in the paper required a fair bit of hyerparameter tuning, and tricks like early stopping. Still worth doing but don't assume there's anything wrong if you can't reproduce.
Adding Sacred support should be fairly easy. Nice if the config parameters can line up fairly closely, you might be able to use one bash script to run both sets of tests, just switch which Python script you're calling. I don't have an opinion on whether that'll be overall easier or not. Parsing the log output and getting that in a consistent format is probably going to be the harder part.
This issue looks stale, should we close?
To benchmark AIRL, I’m (planning on) comparing imitation performance of our new AIRL implementation on modern gym envs against the performance of the old AIRL implementation (on adamgleve/inverse_rl) on old+analogous gym envs.
I would like to compare the following envs (against their old+analogous equivalents):
In addition I’m planning on testing "self-transfer learning" performance of all the environments. This entails training AIRL on the expert demonstrations, and then using the reward learned from AIRL to train a new expert on the same environment, and finally evaluating the average return of the new expert.
Finally, I'm planning on testing transfer learning performance for the TwoDMaze and Ant environments (like in the AIRL paper).