API Changes - Githubissues

AdamGleave commented 5 years ago

Issue to collect possible changes to the AIRL API. Feel free to treat as a Wiki and edit the main post to add issues.

[x] Adam: Have AIRLTrainer take a set of expert demonstrations, rather than expert policies. Requiring expert policies effectively precludes using real human data.
[x] Sam: Replace uses of BaseRLModel with BasePolicy wherever possible. e.g rollout functions should only need a policy, not a complete RL algorithm. Some imitation algorithms also only need policies.
- (Adam: BaseRLModel does some preprocessing in predict, so the distinction between policy and RL algorithm is not totally clear in Stable Baselines.)
- (Steven: In Stable Baselines, there isn't an easy save/load API for policies. In this respect it's much easier to work with BaseRLModel.)
- (Sam: Those are both good points. However, the SB abstractions don't make sense for this project, where some of the algorithms we use don't do RL, and thus do not have a BaseRLModel to begin with. Possibly I should create an issue upstream in SB calling for policies, algorithms, preprocessors, and rollout utilities to be decoupled so that we can re-use their policy abstractions.)
[x] Sam: Make rollout functions trajectory-centric instead of transition-centric. That should make it easier to compute trajectory-wise return, to work with trajectory-centric algorithms, to add absorbing states to the end of trajectories, etc. Can just have flatten() fn that goes from trajectories back to transitions. (Adam: :+1:)
[x] Adam: not a fan of letting env be either an Env, VecEnv, str. It shouldn't be imitation's responsibility to create an environment: how would it know what preprocessing the user wants, how many copies of the environment, how to parallelize (DummyVecEnv or SubprocVecEnv), etc? It is convenient but I think we can get most of that benefit just by providing some helper methods to easily create e.g. vector environments.
[x] Adam: who's responsibility is it to create the session? There are two sensible options IMO. 1) imitation is responsible for creating a fresh session and graph. Everything it does is isolated to this. The user can access these if they want (through an instance variable sess and graph) but may also have their own sessions. This is how Stable Baselines does it. 2) imitation uses the default session and graph. The user is responsible for setting them up. If they want to isolate it, they must create a different session for imitation. This is how I've been writing my own project.

Right now we're doing an awkward hybrid of the two. We're creating self._sess = tf.Session() in Trainer. However, we're not guaranteed to use it! It's only the default session if it's the first session created. And we're not creating fresh graphs (this is why we're having to do tf.reset_default_graph() all over the place in tests.)
[x] Steven: It's awkward that init_trainer lives inside it's own file in imitation/util/trainer.py. Really this should be in imitation/trainer.py.
[x] Sam: The imitation algorithms are currently not very well separated. AIRL, GAIL, MCE IRL and BC are all top-level modules, which makes it hard to figure out which modules implement algorithms & which modules implement utilities. This could be addressed by having packages named algos.gail, algos.airl, algos.mce_irl, etc.
[x] Sam: some things are AIRL/GAIL-specific, but named as though they are generic imitation utilties. They should either be renamed, or rewritten to handle different kinds of algorithms. For instance, the Trainer class has a confusing name given its functionality; it should be called something like AdversarialILTrainer. Likewise, the train.py script only handles AIRL/GAIL—it should either be renamed to train_adversarial, or rewritten to offer a common interface to different kinds of imitation algorithms. (Adam: :+1:)
- [ ] (Steven: related -- DiscrimNet should be renamed to AdversarialNet or something instead, because AIRLDiscrimNet contains a RewardNet`)
- [x] Adam, Steven: In RewardVecEnv.___init___, remove the include_steps parameter in favor of passing steps to every reward function. (link)
[x] Adam: use named tuples for things like TransitionsTuple.
[x] Adam, Steven: n_episodes and n_timesteps should be replaced by min_* in several places, because rollout functions guarantee a minimum number of {episodes, timesteps}, not an exact number.
[x] Sam: LayersDict and sequential() simply re-implement keras.Model and keras.Sequential, respectively, and should be replaced with their Keras equivalents. LayersSerialisable should also be renamed to ModelSerialisable or something and made to save keras.Models instead of TF BaseLayers (this won't require any real change in code, because the tf.train.Checkpoint code used by LayersSerialisable is happy to take keras.Models anyway).
[x] Sam: refactoring in AIRL/GAIL logging:
- [x] Evaluation of TensorBoard summary ops in AIRL/GAIL should occur when the model is trained, during a single sess.run() call, not with a separate _summarize() call that requires re-evaluating the whole network and its losses again.
- [x] For the same reason, AdversarialTrainer.eval_disc_loss() should be removed, and discriminator loss should instead be evaluated (and logged) during discriminator training updates.
- [x] Trajectory sampling should be all moved to one place. Ideally it should be done in AdversarialTrainer.train_gen, or performed by the caller of AdversarialTrainer.train_gen, then passed through all functions that need trajectories. The current arrangement means that train_adversarial.train() samples 5x as many trajectories as it needs to:
  - During train_gen(), Stable Baselines samples its own trajectories as part of .learn().
  - To fill the replay buffer, _populate_gen_replay_buffer() samples more trajectories, instead of re-using the ones from SB.
  - _TrainVisualiser.add_data_ep_reward is then called three times (one for each configuration of reward shaping being applied) to compute statistics, sampling a completely new set of trajectories each time.
[x] Sam (writing down Adam's suggestions): current RewardNet abstraction only makes sense for AIRL. Would be good to have a RewardNet that works for all algorithms (including, e.g., MCE IRL, GCL, etc.). Specifics:
- [x] There's one ABC for reward nets that only has reward_output abstract property and nothing else. There's a concrete class that takes in observation/action spaces, preprocessing kwargs, (optional) action/observation placeholders, and a function for constructing models, then builds graph for computing reward_output.
- [x] Shaping in AIRL is achieved by constructing two RewardNets: one for main reward, and two for past/future potential functions (shared weights for both). See Adam's reward model code for example of how this could work.
[ ] Sam: some subjective/opinionated changes:
- [ ] Getters should be removed unless they serve some functional purpose (e.g. they're accompanied by a setter that is necessary to ensure consistency of some auxiliary data structures or something). At the moment there are a lot of getters of the form @property def thing(self): return self._thing. These make the code harder to grep (how do I figure out where self.thing came from?), but don't seem to serve any other purpose.
- [ ] This is vague, but it feels like a lot of the code needs to be flattened out in order to be made comprehensible. TF graph construction is a good example of excessive hierarchy. For instance, there's something like a dozen methods you need to thread through to figure out how the AIRL training forward pass (i.e. forward-prop + loss + optimiser update + statistics) is computed, including AdversarialTrainer._build_disc_train, AdversarialTrainer._build_summarize, DiscrimNet.build_disc_loss, DiscrimNetAIRL.build_disc_loss, BasicShapedRewardNet.__init__, RewardNetShaped.__init__, RewardNet.__init__, BasicShapedRewardNet.build_theta_network, BasicShapedRewardNet.build_phi_network, RewardNetShaped.build_summaries, etc. etc. Abuse of inheritance makes it even harder to read: in *RewardNet* there are cases where subclass.foo() will call superclass.foo(), which will call subclass.bar() and return to subclass.foo(), after which subclass.foo() will use the result of subclass.bar() produced by the up-call to superclass.foo(). That means you have to read all the subclass and superclass code to understand what subclass.foo() actually does! In most DL algorithm implementations, there would be little or no inheritance, all of the code for constructing a graph would be in one or two methods: here is an unpolished, but functional, example. I'm willing to bet that the ~750 lines of code in discrim_net.py and reward_net.py could be shrunk down to 200-300 lines of code if AIRL and GAIL logic was separated out, and most uses of inheritance were replaced with flags given to constructors.
- [ ] On a related note, it would probably be simpler to make AIRL and GAIL into separate algorithms (i.e. have gail.py, airl.py in algorithms/) with some common supporting code instead of combining the two with a common AdversarialTrainer and DiscrimNet. That would also force us to come up with a sensible common interface for IL algorithms, instead of having all the scripts assume that they're working with an AdversarialTrainer.

shwang commented 4 years ago

Looks like the items here are either completed, or outdated at this point? Perhaps with the exception of the DiscimNet refactor?

shwang commented 4 years ago

Might be good to break out some of these guys into their own issues at this point if they are still worth pursuing.

AdamGleave commented 4 years ago

I'm fine with closing this issue. I ticked off some of the ones which were solved or stale. The only issue I still cared about was the reward model interface, I made a new issue at https://github.com/HumanCompatibleAI/imitation/issues/246 for this

AdamGleave commented 3 years ago

I'm closing this. Most of them do seem to be addressed. Sam's point about generally de-complexifying the code and avoiding excessive hierarchy I think is still relevant, though; I'd be happy to have an issue on that, or it's something we can just bear in mind when making subsequent changes.

HumanCompatibleAI / imitation

API Changes #31