Consistent interface for algorithms & single training script

AdamGleave commented 5 years ago

Currently we don't have any CLI script for behavioral cloning or the density baseline. I envisage this codebase as being particularly useful in being able to rapidly benchmark against a wide variety of algorithms. For this to be possible, we need to have (as far as is possible) a consistent interface between different imitation learning algorithms, similarly to how Stable Baselines has a similar interface for all RL algorithms implemented. Ideally we'd then also have imitation.scripts.train work for all of them, although there'll clearly need to be som algorithm-specific configs.

shwang commented 3 years ago

We now have a SB3-like .train(total_timesteps) interface for GAIL, AIRL, DAgger, and BC. Also we have Sacred scripts for each of these algorithms, though they are not unified into a single train.

AdamGleave commented 1 year ago

We've made a lot of progress since this issue was first opened but there's still room for improvement so I'm leaving this open for now.

ernestum commented 1 year ago

I think this issue is required before we can address #587 and start a new take on #602. I reviewed the scripts and this are some points we should address:

This is the current relationship between scripts and algorithms

Algo	Script	comments
`DAgger`, `BC`	`train_imitation`	BC is a "hidden feature" by setting use_dagger=False
`MCE IRL`	-	do we need a script for this?
`AIRL`, `GAIL`	`train_adversarial`	They being thrown together into one script results in lots of weird constructs such as an `algorithm_specific` configuration that are merged with the config using dubious hacks and configuration being modified during runtime.
`PreferenceComparisons`	`train_preference_comparisons`	-

There is a reward ingredient, that should probably be called reward_net. It takes care of edge cases for AIRL which I think is a bad idea. It also has some named configs that only enable/disable individual configurations. I think this is something that should be handled by the config_hook.

There is a train ingredient which should be split up into a policy and an evaluation ingredient. Ingredients are objects, not verbs.

We should probably have an ingredient for each algorithm. This makes constructing more complex experiments (such as warm-starting) easier. I also would vote for one train_<algo> script per algorithm to follow unix philosophy that each tool should only do one job.

AdamGleave commented 1 year ago

We should probably have an ingredient for each algorithm. This makes constructing more complex experiments (such as warm-starting) easier.

Note sacred.Experiment is-a sacred.Ingredient (parent class), so I think we already have this more or less? EDIT: After seeing #663 I do think there's a case for making BC an ingredient if we want to use it to warm-start a variety of other algorithms, and because it already has a compositional relationship with DAgger. BC might be a bit of a special case in that regard though.

I also would vote for one train_ script per algorithm to follow unix philosophy that each tool should only do one job.

I feel torn here. It's nice to be able to do benchmarks with a variety of algorithms using a consistent interface. Having one script with different subcommands seems more natural here. You could split it up into different scripts, but it seems like a bit of a nightmare to maintain a consistent interface across many files, although perhaps doable if we can factor out most things into ingredients?

However, the algorithms do have different enough fundamental properties we can't literally have the same interface for all. Current approach was a compromise grouping similar algorithms together: e.g. AIRL/GAIL are almost identical except for the parametric form of reward network. But I agree the way they (and BC/DAgger) are currently merged is pretty ad-hoc.

Before we do a complete rewrite of the scripts I do think it's worth reflecting whether we want to keep building on top of Sacred. I do like it but the project is sadly basically abandonware at this point.

ernestum commented 1 year ago

Good point, that each Experiment is actually an Ingredient. I will double-check where it makes sense to turn algorithms to ingredients. It would be pointless to have e.g. a DAggerExperiment who only as a single DAggerIngredient.

IMO the consistent interface should result from a consistent design of the ingredients and their configuration. The scripts just pull together the ingredients to form experiments. They allow to run an experiment with the given ingredients but also to explore what configuration values are needed for it. When I run python -m imitation.scripts.<some_script> print_config I expect it to only show me the configuration specific to what does. E.g. right now we use the train_imitation to train BC as well as DAgger. If I am only interested in training BC, then I don't want to be bothered with the configuration values for DAgger in the print_config output.

Throwing together AIRL and GAIL eventually resulted in some weird design choices (see comments in above table). It is tempting to fuse them since they are so similar but I feel like the decision to generalize here (which is a good idea on its own) required so many hacks that the gained elegance/reduction of redundancy was outweighed.

I totally agree that we should review the usage of Sacred before we put too much time into refactoring the scripts. Why do you believe that Sacred is abandoned? When I look just at their releases page it does not seem too bad. Alternatives to consider are Guild AI and MLFlow. After a quick glance Guild AI looks more promising but we should investigate this more in-depth.

AdamGleave commented 1 year ago

E.g. right now we use the train_imitation to train BC as well as DAgger. If I am only interested in training BC, then I don't want to be bothered with the configuration values for DAgger in the print_config output.

That's a good point, I think I'm convinced it's worth splitting up scripts, so long as we can avoid substantial code duplication by splitting things into ingredients.

Throwing together AIRL and GAIL eventually resulted in some weird design choices (see comments in above table). It is tempting to fuse them since they are so similar but I feel like the decision to generalize here (which is a good idea on its own) required so many hacks that the gained elegance/reduction of redundancy was outweighed.

I agree there's a lot of hacks. I still feel a bit uncertain about this, probably easier to discuss with a concrete idea of how to split them up and what can be shared via ingredients etc. It would be nice to have an easy way to use shared configs between AIRL/GAIL as the right hyperparameters are often quite similar, although this use case is less common now we're using automatically tuned hyperparameters (the configs saved in benchmarks/).

I totally agree that we should review the usage of Sacred before we put too much time into refactoring the scripts. Why do you believe that Sacred is abandoned? When I look just at their releases page it does not seem too bad. Alternatives to consider are Guild AI and MLFlow. After a quick glance Guild AI looks more promising but we should investigate this more in-depth.

Abandoned was too strong a word. The original maintainers of Sacred stepped back from the project and did not handover cleanly. Currently @thequilo has been doing a heroic job reviewing and fixing outstanding issues, and things are improving now that he's got access to make PyPI releases. However, the situation is a bit precarious: if @thequilo leaves there's no ones, and he still doesn't have full permissions on the project. https://github.com/IDSIA/sacred/issues/871#issuecomment-1164096520 and https://github.com/IDSIA/sacred/issues/879 have some relevant discussion.

Thanks for looking into alternatives. From a quick skim of Guild AI's docs agree it seems promising, though some design decisions I feel hesitant about (e.g. treating global constants as flags by default...)

thequilo commented 1 year ago

@AdamGleave is not wrong, there is a risk of sacred becoming unmaintained when I step out. As I am currently a Ph.D. student, and it's completely unclear what will happen after I finished my Ph.D., this might happen in the not-so-far future. But, since my lab is heavily using sacred for all experiment tracking, I'm sure we'll find someone from my lab to continue to maintain sacred, given that Qwlouse gives us access.

AdamGleave commented 1 year ago

Thanks for clarifying @thequilo -- good to know someone else in your lab is likely to pick it up. And good luck wrapping up your PhD!

ernestum commented 1 year ago

I had a deeper look at Guild AI, ML Flow and neptune.ai. They all are mostly geared to logging, analysis and reproducibility. In comparison to sacred they lack the (hierarchical) configuration management system with experiments and (reusable) ingredients.

I think we have two options here: (would be interested in your opinion @AdamGleave ? I would lean towards option 1)

1. Stick with sacred

... and hope that it stays maintained. Maybe put some of our time into maintaining it if necessary, maybe push the rl-baselines-zoo to also use sacred.

Pro

avoids lots of code duplication
easy to add new parameters ad-hoc (no need to change config files for that)
Con
we depend on sacred and it is unclear for how long it will be maintained
the nested nature of experiments/ingredients invites for (often unnecessary) complexity in the experiment scripts which might sometimes outweigh the benefits of reduced code duplication

2. Use any of the other frameworks

... (probably Guild AI) and use their configuration system (or something else like argparse/click)

Pro

We don't depend on sacred (if using Guild AI we even minimize dependency to one particular framework)
Other tools support things like HP tuning/grid runs out of the box (no need to maintain our own parallel.py script)
Con
We need to rewrite all our scripts
Configuration of experiments will become harder (we need to do everything with argparse/click/global variables and there is no support for nested config scopes)

AdamGleave commented 1 year ago

I'll try to look at this next week. @qxcv any thoughts on above, I think you've used Sacred a fair bit?

qxcv commented 1 year ago

I don't have a strong opinion on MLFlow/Guild AI/neptune.ai. I have used and like click (for simple stuff), especially over argparse. As for Sacred, I only used it for the EIRLI project. I've since moved away from using Sacred because (1) I only needed config parsing, and not the more sophisticated features (like tracking git revisions, logging to S3/wandb/whatever, etc.), (2) I couldn't get the configs-as-functions stuff to play well with type checking, and (3) I found argument parsing/option override behaviour hard to reason about.

My current go-to is Hydra with structured configs. Structured configs mean that your config is just a dataclass. This was an improvement for me, but still has shortcomings, like poor support for union types and a slightly awkward API for turning on/off "magic" features. Also it may not do some of the things you want if you're already using all the features in Sacred. It has 7k GitHub stars & is used/maintained by Facebook, though, so I expect that it will get better over time. It also has a surprising number of plugins/integrations (submitit/Slurm, Ray, etc.).

ernestum commented 1 year ago

I think we mostly use sacred for the configuration management and then W&B for logging? @AdamGleave you probably know better what other features of sacred are used but in the code it is mostly about configuration.

I will have a look at Hydra thanks for the hint! What made you move from click to Hydra?

qxcv commented 1 year ago

What made you move from click to Hydra?

Mostly the fact that I could declare my config as a (nested) dataclass & pass around the appropriate dataclass for each part of the code. Doing that with Click would require a lot of code duplication (repeating arguments as both Click options and dataclass members) and some extra parsing magic (to handle the nested options).

The Sacred-like experiment management features in Hydra (automatically creating experiment directories, automatically setting up the logging module) are sometimes mildly convenient, but honestly something like this library would have solved my problems just as well.

ernestum commented 1 year ago

I think the idea of putting configs in data-classes is interesting. I could implement a draft of a port to Hydra for one of the scripts to explore any unforeseeable difficulties coming up. What do you think @AdamGleave ?

Edit: looks like Hydra would let us get rid of the parallel.py script!

AdamGleave commented 1 year ago

I'm in favor of trying to port one script to Hydra to try it out.

Eliminating parallel.py would be nice, I wrote that as a temporary hacky solution and somehow it's stuck around.

One feature Hydra seems to be missing is logging Git commit hash and other things needed for reproducibility. That said, seems like one could add that via callbacks, and WandB already does that so I don't think we need our config framework to do it.

ernestum commented 1 year ago

Feel free to have a look at #703 to see how it turned out.

qxcv commented 1 year ago

Wow, thanks for all that work converting things to Hydra! It's using a lot of patterns that I didn't even know existed, like instantiating objects with _target_. I haven't reviewed #703 in detail, but the PR seems very cool from just a skim.

HumanCompatibleAI / imitation

Consistent interface for algorithms & single training script #79

1. Stick with sacred

Pro

Con

2. Use any of the other frameworks

Pro

Con