HumanCompatibleAI / imitation

Clean PyTorch implementations of imitation and reward learning algorithms
https://imitation.readthedocs.io/
MIT License
1.27k stars 241 forks source link

AIRL: double inputs preprocessing error #229

Closed shwang closed 3 years ago

shwang commented 3 years ago

AIRL is broken because its inputs are accidentally preprocessed twice by SB helper functions.

shwang commented 3 years ago

This error has been around since the pytorch port.

AdamGleave commented 3 years ago

Wow, nice catch. Are there any simple tests we can add that would catch issues like this?

shwang commented 3 years ago

Should mention that @qxcv found this bug!

qxcv commented 3 years ago

Should mention that @qxcv found this bug!

I also created the bug—a double achievement! :wink:

araffin commented 3 years ago

can you link the issue? Is that the image processing issue?

qxcv commented 3 years ago

This is not a bug in SB, I just made the mistake of applying the SB helper functions twice in AIRL (at least along some code paths).

andipeng commented 3 years ago

where exactly is the bug? working through airl now and can't seem to find double processing of inputs

shwang commented 3 years ago

@andipeng:

If I recall correctly: AIRL observations, actions, and next_obs are processed once here in AdversarialTrainer: https://github.com/HumanCompatibleAI/imitation/blob/cbcbbc78f7b191031bba30b4848e97b03377b07c/src/imitation/algorithms/adversarial.py#L321-L327

And then a second time here in RewardNet._eval_reward while evaluating the generator-policy-training reward: https://github.com/HumanCompatibleAI/imitation/blob/cbcbbc78f7b191031bba30b4848e97b03377b07c/src/imitation/rewards/reward_nets.py#L164-L172

I'm 75% confident that this is what @qxcv was talking about to me in private conversation a few weeks ago?

Interestingly, there's a duplicate function DiscrimNet._eval_reward, even though we manually determined that GAIL doesn't experience the same preprocessing duplication bug.

AdamGleave commented 3 years ago

After digging into this more I don't anything actually gets preprocessed twice.

_eval_reward in RewardNet never gets called when training AIRL or GAIL (verified by adding a pdb.set_trace() there and it never hit), only when a RewardNet is deserialized.

_eval_reward in DiscrimNet does get called. But AFAICT only by RewardVecEnvWrapper, in which case it seems like it should preprocess, since these will not have gone through _torchify_with_space coming directly from the environment.

So it seems all is OK, although there is a lot of code duplication here which should probably be cleaned up.

@qxcv @shwang does that seem right to you? Do we do duplicate preprocessing anywhere else?

qxcv commented 3 years ago

I can't immediately recall where the issue was. I'm going to write a test for it, though, since I struggled a lot with normalisation issues on the IL representations project. One easy way is to check that things are working correctly is to add an assertion to your convnet which checks that all the inputs have intensities between 0 and 1, but also have a reasonable range (i.e. greater than 1/255, which is what would happen if the input was double-normalised). I'll probably do that with a simple environment that yields observations with a constant (uint8) pixel intensity of 255, or maybe one that alternates between 0 and 255.

AdamGleave commented 3 years ago

Tests sound good, the code is convoluted enough it's hard to be confident just reading it. Though I'll also think if there's a way to refactor this to make it easier to understand.

andipeng commented 3 years ago

Hi all -- I found the following issue with normalization in both algorithms:

  1. On lines 383-384 in adversarial.py, the generator/expert inputs are normalized using the normalize_obs function from VecNormalize, whereas the rest of the code including discriminator and training env are normalized using the preprocess_obs function from processing.
  2. Moreover, only the current expert/generator samples are normalized when the batch is created, whereas the next_obs on line 389 is never normalized.

I tried fixing normalization to be constant, and it seems to be better for GAIL (I'm not sure if I'm still missing something because it's very slow to converge), but AIRL is still broken and so I'm not sure what else is going on there.

AdamGleave commented 3 years ago

@andipeng it looks like we had a similar realization. Does https://github.com/HumanCompatibleAI/imitation/pull/256 cover the issues you found? I think it fixes 2) but I'm not sure about 1)

One other thing I suspect is an issue: the scale and mean of the learned reward jumps around a lot during training. VecNormalize only adjusts the scale (does not center it) and takes a running standard deviation, which is OK if the reward has constant standard deviation but not when it's non-stationary.

I thought taking an EWMA would help but... it didn't seem to. But I only tested it on Cartpole, and didn't do much hyperparameter tuning. So it might be worth looking into further, here's my code: https://github.com/HumanCompatibleAI/imitation/commit/02d0854251d69f39f428918be450ed4276865927

I think a better solution might actually be to make the reward network use batchnorm and fix the output mean to zero and standard deviation to one. I don't have time to try this now so others should feel free to look into it. Otherwise I'll try to follow up on it later.

andipeng commented 3 years ago

@AdamGleave Unfortunately not :/ I just tried running gail/airl again on my env and while it's certainly learning better than without the next_obs fix, I suspect the lack of standardization between VecNormalize on the samples vs. preprocessing_obs on the env is still causing issues.

If curious, I am running both on https://github.com/andipeng/simple-gridworld/blob/master/gym_minigrid/envs/simpleGrid.py, an extension in Minigrid. id=MiniGrid-SimpleGrid-Random-5x5-v0,. Max reward should be ~6, whereas it's only hitting ~1.5 and then dying currently.

AdamGleave commented 3 years ago

@AdamGleave Unfortunately not :/ I just tried running gail/airl again on my env and while it's certainly learning better than without the next_obs fix, I suspect the lack of standardization between VecNormalize on the samples vs. preprocessing_obs on the env is still causing issues.

In that PR I moved VecNormalize to before the RewardVecEnvWrapper. So I believe the reward network should be seeing normalized observations both during training and inference. The preprocessing I believe happens after normalization in both code paths. That said, I believe the preprocessing is actually just an identity for spaces.Box unless the observations look image-shaped.

That said, the code is quite convoluted so I may well be missing something here. Could you give examples of the error? Or share an alternative fix we can compare?

Max reward should be ~6, whereas it's only hitting ~1.5 and then dying currently.

What do you mean by dying? Does the reward get worse? Plateau? The script crashes?

The performance of this implementation of AIRL hasn't been great in my experiments either. But there may well be something wrong besides the normalization.

andipeng commented 3 years ago

Sure -- using your PR above, gail looks like this:

Screen Shot 2020-12-10 at 12 21 25 PM

My fix is not great, but I remove normalize entirely and just train on the raw samples + env. My gail looks like this when I do that:

Screen Shot 2020-12-10 at 12 22 27 PM

Strangely, airl still explodes in the end, even with normalize disabled completely. image

AdamGleave commented 3 years ago

My fix is not great, but I remove normalize entirely and just train on the raw samples + env. My gail looks like this when I do that:

Thanks for sharing. I'll take a closer look at my code then to see if there's something wrong. Is there any particular line you think is suspect?

There is a confounder here if your fix just removes normalization. Normalization does not always improve performance, so it could be that this environment just does better without normalization (at least for our default hyperparameters).

andipeng commented 3 years ago

If we're confident about your VecNormalize playing well with the normalize_obs function, then I don't have an idea as to a specific issue line (because that's what I originally thought was causing the error).

My gut still tells me that there's something fishy about a mismatch in normalization for two reasons: 1) I originally tried a hack where I implemented a division by a constant (to force my specific obs to be between 0 and 1) on both the reward network and the RL network in stable-baselines3, and that worked fine albeit is clearly unscalable; 2) normalization with PPO works perfectly as well.

I don't know what's causing gail to work but not airl though, so maybe the airl issue is entirely separate from normalization. Thanks for looking into this Adam.

AdamGleave commented 3 years ago

If we're confident about your VecNormalize playing well with the normalize_obs function, then I don't have an idea as to a specific issue line (because that's what I originally thought was causing the error).

VecNormalize just calls normalize_obs internally on the wrapped observations (as well as updating the internal statistics). So they should return the same values. That said, someone else should definitely take a look at this code, since it's clearly a hotspot for problems.

My gut still tells me that there's something fishy about a mismatch in normalization for two reasons: 1) I originally tried a hack where I implemented a division by a constant (to force my specific obs to be between 0 and 1) on both the reward network and the RL network in stable-baselines3, and that worked fine albeit is clearly unscalable; 2) normalization with PPO works perfectly as well.

We have two VecNormalize, one of which is normalizing just the observations and the other just the rewards. Can you try: 1) Turning off the reward normalization one (python -m imitation.scripts.train_adversarial with algorithm_kwargs.shared.normalize_reward=False plus your usual arguments)? This should replicate your manual rescaling. 2) Also what happens with algorithm_kwargs.shared.debug_use_ground_truth=True -- it should be basically the same as running PPO at that point?

I don't know what's causing gail to work but not airl though, so maybe the airl issue is entirely separate from normalization. Thanks for looking into this Adam.

It's hard to know, GAIL is generally more stable, but that can also sometimes mask bugs.

andipeng commented 3 years ago

VecNormalize just calls normalize_obs internally on the wrapped observations (as well as updating the internal statistics). So they should return the same values. That said, someone else should definitely take a look at this code, since it's clearly a hotspot for problems.

Doesn't VecNormalize (on line 155) calculate the running mean/sd whereas _torchify_with_space calls the preprocess_obs to normalize by the observation_space type? In cases where our training env is normalized with the first while our generator/expert samples are normalized by the latter (say, for Image or Discrete obs or action spaces), there would be mismatch, no?

We have two VecNormalize, one of which is normalizing just the observations and the other just the rewards. Can you try:

  1. Turning off the reward normalization one (python -m imitation.scripts.train_adversarial with algorithm_kwargs.shared.normalize_reward=False plus your usual arguments)? This should replicate your manual rescaling.

normalize_reward does not appear to be a valid arg.

  1. Also what happens with algorithm_kwargs.shared.debug_use_ground_truth=True -- it should be basically the same as running PPO at that point?

Yep - works great, same as ppo (achieves max reward + very quick to converge).

AdamGleave commented 3 years ago

VecNormalize just calls normalize_obs internally on the wrapped observations (as well as updating the internal statistics). So they should return the same values. That said, someone else should definitely take a look at this code, since it's clearly a hotspot for problems.

Doesn't VecNormalize (on line 155) calculate the running mean/sd whereas _torchify_with_space calls the preprocess_obs to normalize by the observation_space type? In cases where our training env is normalized with the first while our generator/expert samples are normalized by the latter (say, for Image or Discrete obs or action spaces), there would be mismatch, no?

I believe in both cases the observations are first normalized using VecNormalize and then preprocess_obs is called.

For reward/discriminator training, in https://github.com/HumanCompatibleAI/imitation/blob/343eb25c551909c92af50a6c6d0162d4b31e1cb1/src/imitation/algorithms/adversarial.py#L403 we call normalize_obs on the VecNormalize object and then in https://github.com/HumanCompatibleAI/imitation/blob/343eb25c551909c92af50a6c6d0162d4b31e1cb1/src/imitation/algorithms/adversarial.py#L419 we call _torchify_with_space which calls preprocess_obs.

For reward inference, venv_norm_obs https://github.com/HumanCompatibleAI/imitation/blob/343eb25c551909c92af50a6c6d0162d4b31e1cb1/src/imitation/algorithms/adversarial.py#L158 occurs before venv_wrapped, so self.discrim.predict_reward_train sees already normalized observations. This function then later calls preprocess_obs: https://github.com/HumanCompatibleAI/imitation/blob/343eb25c551909c92af50a6c6d0162d4b31e1cb1/src/imitation/rewards/discrim_nets.py#L161

I believe these are the only places we use observations. Please let me know if I've missed something.

normalize_reward does not appear to be a valid arg.

This works for me:

python -m imitation.scripts.train_adversarial with airl algorithm_kwargs.shared.normalize_reward=False

What command are you running? Is it on the latest commit to that PR?

Yep - works great, same as ppo (achieves max reward + very quick to converge).

OK, good to know, so our RL training is not broken at least.

andipeng commented 3 years ago

This works for me:

python -m imitation.scripts.train_adversarial with airl algorithm_kwargs.shared.normalize_reward=False

On the correct commit now - normalize_reward=False leads to this for AIRL

Screen Shot 2020-12-11 at 4 47 40 PM

AdamGleave commented 3 years ago

Since we seem to be in agreement the bug that this issue described doesn't exist I'm closing it. Please feel free to re-open if we detect any actual duplicate preprocessing, or open a new issue if we uncover anything else that could explain AIRL's current under-performance.