HumanCompatibleAI / imitation

Clean PyTorch implementations of imitation and reward learning algorithms
https://imitation.readthedocs.io/
MIT License
1.28k stars 244 forks source link

Benchmark and replicate algorithm performance #388

Open AdamGleave opened 2 years ago

AdamGleave commented 2 years ago

Tune hyperparameters / match implementation details / fix bugs until we replicate the performance of reference implementations of algorithms. I'm not concerned about an exact match -- if we do about as well on average but better and worse depending on environments this seems OK.

Concretely, should test BC, AIRL, GAIL, DRLHP, DAgger on at least the seals versions of CartPole, MountainCar, HalfCheetah, Hopper.

Baselines: paper results as first port of call. But some paper results are confounded by different environment version, especially fixed vs variable horizon. SB2 GAIL is a good sanity check. If need be reference implementations of most other algorithms exist, but can be hard to run.

AdamGleave commented 2 years ago

One useful tool might be airspeed velocity (asv) to keep track of metrics over time.

Rocamonde commented 2 years ago
AdamGleave commented 2 years ago

@taufeeque9 has been working on this, but it's a big enough task it might make sense to split it up (e.g. you each take some subset of algorithms and work together on building out a useful pipeline).

I think this is still a blocker for v1. At the least, we don't want to actively start promoting the code until we're confident the algorithms are all performing adequately.

ernestum commented 1 year ago

I am trying to figure out the current state of the benchmarks.

What I could figure out on my own:

What other steps are required until we can close this issue? Which of those steps could I help with?

Rocamonde commented 1 year ago

Things that could be done that I think would be useful (but not necessarily what you should do)

In terms of benchmarking:

In terms of the codebase:

AdamGleave commented 1 year ago

@ernestum benchmarking/util.py cleans up the configs generated by automatic hyperparameter tuning; the output of this is the JSON config files in benchmarking/.

My understanding of the outstanding issues are:

I don't think being able to run benchmarks without cloning the repo is that important, this is primarily something developers would want to run.

ernestum commented 1 year ago

Ok to me it looks like the primary thing I could contribute here is making the execution of benchmarks a bit more future-proof:

Is there interest in this or is that a lower priority thing?

AdamGleave commented 1 year ago

Is there interest in this or is that a lower priority thing?

Cleaning up and documenting utils.py seems worthwhile.

Documenting how to generate the summary table also worthwhile. Although I'm not that happy with the current workflow, imitation.scripts.analyze seems overly-complex for what it does, so you might want to be on the lookout for ways to improve the table generation.