Closed AdamGleave closed 3 years ago
Looks like the items here are either completed, or outdated at this point? Perhaps with the exception of the DiscimNet refactor?
Might be good to break out some of these guys into their own issues at this point if they are still worth pursuing.
I'm fine with closing this issue. I ticked off some of the ones which were solved or stale. The only issue I still cared about was the reward model interface, I made a new issue at https://github.com/HumanCompatibleAI/imitation/issues/246 for this
I'm closing this. Most of them do seem to be addressed. Sam's point about generally de-complexifying the code and avoiding excessive hierarchy I think is still relevant, though; I'd be happy to have an issue on that, or it's something we can just bear in mind when making subsequent changes.
Issue to collect possible changes to the AIRL API. Feel free to treat as a Wiki and edit the main post to add issues.
AIRLTrainer
take a set of expert demonstrations, rather than expert policies. Requiring expert policies effectively precludes using real human data.BaseRLModel
withBasePolicy
wherever possible. e.g rollout functions should only need a policy, not a complete RL algorithm. Some imitation algorithms also only need policies.BaseRLModel
does some preprocessing inpredict
, so the distinction between policy and RL algorithm is not totally clear in Stable Baselines.)save
/load
API for policies. In this respect it's much easier to work withBaseRLModel
.)BaseRLModel
to begin with. Possibly I should create an issue upstream in SB calling for policies, algorithms, preprocessors, and rollout utilities to be decoupled so that we can re-use their policy abstractions.)flatten()
fn that goes from trajectories back to transitions. (Adam: :+1:)env
be either anEnv
,VecEnv
,str
. It shouldn't beimitation
's responsibility to create an environment: how would it know what preprocessing the user wants, how many copies of the environment, how to parallelize (DummyVecEnv
orSubprocVecEnv
), etc? It is convenient but I think we can get most of that benefit just by providing some helper methods to easily create e.g. vector environments.[x] Adam: who's responsibility is it to create the session? There are two sensible options IMO. 1) imitation is responsible for creating a fresh session and graph. Everything it does is isolated to this. The user can access these if they want (through an instance variable
sess
andgraph
) but may also have their own sessions. This is how Stable Baselines does it. 2) imitation uses the default session and graph. The user is responsible for setting them up. If they want to isolate it, they must create a different session for imitation. This is how I've been writing my own project.Right now we're doing an awkward hybrid of the two. We're creating
self._sess = tf.Session()
inTrainer
. However, we're not guaranteed to use it! It's only the default session if it's the first session created. And we're not creating fresh graphs (this is why we're having to dotf.reset_default_graph()
all over the place in tests.)init_trainer
lives inside it's own file inimitation/util/trainer.py
. Really this should be inimitation/trainer.py
.algos.gail
,algos.airl
,algos.mce_irl
, etc.Trainer
class has a confusing name given its functionality; it should be called something likeAdversarialILTrainer
. Likewise, thetrain.py
script only handles AIRL/GAIL—it should either be renamed totrain_adversarial
, or rewritten to offer a common interface to different kinds of imitation algorithms. (Adam: :+1:)AdversarialNet
or something instead, because AIRLDiscrimNet contains a RewardNet`)RewardVecEnv.___init___
, remove theinclude_steps
parameter in favor of passingsteps
to every reward function. (link)TransitionsTuple
.n_episodes
andn_timesteps
should be replaced bymin_*
in several places, because rollout functions guarantee a minimum number of {episodes, timesteps}, not an exact number.sess.run()
call, not with a separate_summarize()
call that requires re-evaluating the whole network and its losses again.AdversarialTrainer.eval_disc_loss()
should be removed, and discriminator loss should instead be evaluated (and logged) during discriminator training updates.AdversarialTrainer.train_gen
, or performed by the caller ofAdversarialTrainer.train_gen
, then passed through all functions that need trajectories. The current arrangement means thattrain_adversarial.train()
samples 5x as many trajectories as it needs to:train_gen()
, Stable Baselines samples its own trajectories as part of .learn()._populate_gen_replay_buffer()
samples more trajectories, instead of re-using the ones from SB._TrainVisualiser.add_data_ep_reward
is then called three times (one for each configuration of reward shaping being applied) to compute statistics, sampling a completely new set of trajectories each time.RewardNet
abstraction only makes sense for AIRL. Would be good to have aRewardNet
that works for all algorithms (including, e.g., MCE IRL, GCL, etc.). Specifics:reward_output
abstract property and nothing else. There's a concrete class that takes in observation/action spaces, preprocessing kwargs, (optional) action/observation placeholders, and a function for constructing models, then builds graph for computingreward_output
.RewardNet
s: one for main reward, and two for past/future potential functions (shared weights for both). See Adam's reward model code for example of how this could work.@property def thing(self): return self._thing
. These make the code harder to grep (how do I figure out whereself.thing
came from?), but don't seem to serve any other purpose.AdversarialTrainer._build_disc_train
,AdversarialTrainer._build_summarize
,DiscrimNet.build_disc_loss
,DiscrimNetAIRL.build_disc_loss
,BasicShapedRewardNet.__init__
,RewardNetShaped.__init__
,RewardNet.__init__
,BasicShapedRewardNet.build_theta_network
,BasicShapedRewardNet.build_phi_network
,RewardNetShaped.build_summaries
, etc. etc. Abuse of inheritance makes it even harder to read: in*RewardNet*
there are cases wheresubclass.foo()
will callsuperclass.foo()
, which will callsubclass.bar()
and return tosubclass.foo()
, after whichsubclass.foo()
will use the result ofsubclass.bar()
produced by the up-call tosuperclass.foo()
. That means you have to read all the subclass and superclass code to understand whatsubclass.foo()
actually does! In most DL algorithm implementations, there would be little or no inheritance, all of the code for constructing a graph would be in one or two methods: here is an unpolished, but functional, example. I'm willing to bet that the ~750 lines of code indiscrim_net.py
andreward_net.py
could be shrunk down to 200-300 lines of code if AIRL and GAIL logic was separated out, and most uses of inheritance were replaced with flags given to constructors.AdversarialTrainer
andDiscrimNet
. That would also force us to come up with a sensible common interface for IL algorithms, instead of having all the scripts assume that they're working with anAdversarialTrainer
.