Closed shwang closed 3 years ago
This error has been around since the pytorch port.
Wow, nice catch. Are there any simple tests we can add that would catch issues like this?
Should mention that @qxcv found this bug!
Should mention that @qxcv found this bug!
I also created the bug—a double achievement! :wink:
can you link the issue? Is that the image processing issue?
This is not a bug in SB, I just made the mistake of applying the SB helper functions twice in AIRL (at least along some code paths).
where exactly is the bug? working through airl now and can't seem to find double processing of inputs
@andipeng:
If I recall correctly:
AIRL
observations, actions, and next_obs are processed once here in AdversarialTrainer
:
https://github.com/HumanCompatibleAI/imitation/blob/cbcbbc78f7b191031bba30b4848e97b03377b07c/src/imitation/algorithms/adversarial.py#L321-L327
And then a second time here in RewardNet._eval_reward
while evaluating the generator-policy-training reward:
https://github.com/HumanCompatibleAI/imitation/blob/cbcbbc78f7b191031bba30b4848e97b03377b07c/src/imitation/rewards/reward_nets.py#L164-L172
I'm 75% confident that this is what @qxcv was talking about to me in private conversation a few weeks ago?
Interestingly, there's a duplicate function DiscrimNet._eval_reward
, even though we manually determined that GAIL doesn't experience the same preprocessing duplication bug.
After digging into this more I don't anything actually gets preprocessed twice.
_eval_reward
in RewardNet
never gets called when training AIRL or GAIL (verified by adding a pdb.set_trace()
there and it never hit), only when a RewardNet is deserialized.
_eval_reward
in DiscrimNet
does get called. But AFAICT only by RewardVecEnvWrapper
, in which case it seems like it should preprocess, since these will not have gone through _torchify_with_space
coming directly from the environment.
So it seems all is OK, although there is a lot of code duplication here which should probably be cleaned up.
@qxcv @shwang does that seem right to you? Do we do duplicate preprocessing anywhere else?
I can't immediately recall where the issue was. I'm going to write a test for it, though, since I struggled a lot with normalisation issues on the IL representations project. One easy way is to check that things are working correctly is to add an assertion to your convnet which checks that all the inputs have intensities between 0 and 1, but also have a reasonable range (i.e. greater than 1/255, which is what would happen if the input was double-normalised). I'll probably do that with a simple environment that yields observations with a constant (uint8) pixel intensity of 255, or maybe one that alternates between 0 and 255.
Tests sound good, the code is convoluted enough it's hard to be confident just reading it. Though I'll also think if there's a way to refactor this to make it easier to understand.
Hi all -- I found the following issue with normalization in both algorithms:
normalize_obs
function from VecNormalize, whereas the rest of the code including discriminator and training env are normalized using the preprocess_obs
function from processing.next_obs
on line 389 is never normalized.I tried fixing normalization to be constant, and it seems to be better for GAIL (I'm not sure if I'm still missing something because it's very slow to converge), but AIRL is still broken and so I'm not sure what else is going on there.
@andipeng it looks like we had a similar realization. Does https://github.com/HumanCompatibleAI/imitation/pull/256 cover the issues you found? I think it fixes 2) but I'm not sure about 1)
One other thing I suspect is an issue: the scale and mean of the learned reward jumps around a lot during training. VecNormalize
only adjusts the scale (does not center it) and takes a running standard deviation, which is OK if the reward has constant standard deviation but not when it's non-stationary.
I thought taking an EWMA would help but... it didn't seem to. But I only tested it on Cartpole, and didn't do much hyperparameter tuning. So it might be worth looking into further, here's my code: https://github.com/HumanCompatibleAI/imitation/commit/02d0854251d69f39f428918be450ed4276865927
I think a better solution might actually be to make the reward network use batchnorm and fix the output mean to zero and standard deviation to one. I don't have time to try this now so others should feel free to look into it. Otherwise I'll try to follow up on it later.
@AdamGleave Unfortunately not :/ I just tried running gail/airl again on my env and while it's certainly learning better than without the next_obs fix, I suspect the lack of standardization between VecNormalize on the samples vs. preprocessing_obs on the env is still causing issues.
If curious, I am running both on https://github.com/andipeng/simple-gridworld/blob/master/gym_minigrid/envs/simpleGrid.py, an extension in Minigrid. id=MiniGrid-SimpleGrid-Random-5x5-v0
,. Max reward should be ~6, whereas it's only hitting ~1.5 and then dying currently.
@AdamGleave Unfortunately not :/ I just tried running gail/airl again on my env and while it's certainly learning better than without the next_obs fix, I suspect the lack of standardization between VecNormalize on the samples vs. preprocessing_obs on the env is still causing issues.
In that PR I moved VecNormalize
to before the RewardVecEnvWrapper
. So I believe the reward network should be seeing normalized observations both during training and inference. The preprocessing I believe happens after normalization in both code paths. That said, I believe the preprocessing is actually just an identity for spaces.Box
unless the observations look image-shaped.
That said, the code is quite convoluted so I may well be missing something here. Could you give examples of the error? Or share an alternative fix we can compare?
Max reward should be ~6, whereas it's only hitting ~1.5 and then dying currently.
What do you mean by dying? Does the reward get worse? Plateau? The script crashes?
The performance of this implementation of AIRL hasn't been great in my experiments either. But there may well be something wrong besides the normalization.
My fix is not great, but I remove normalize entirely and just train on the raw samples + env. My gail looks like this when I do that:
Thanks for sharing. I'll take a closer look at my code then to see if there's something wrong. Is there any particular line you think is suspect?
There is a confounder here if your fix just removes normalization. Normalization does not always improve performance, so it could be that this environment just does better without normalization (at least for our default hyperparameters).
If we're confident about your VecNormalize playing well with the normalize_obs function, then I don't have an idea as to a specific issue line (because that's what I originally thought was causing the error).
My gut still tells me that there's something fishy about a mismatch in normalization for two reasons: 1) I originally tried a hack where I implemented a division by a constant (to force my specific obs to be between 0 and 1) on both the reward network and the RL network in stable-baselines3, and that worked fine albeit is clearly unscalable; 2) normalization with PPO works perfectly as well.
I don't know what's causing gail to work but not airl though, so maybe the airl issue is entirely separate from normalization. Thanks for looking into this Adam.
If we're confident about your VecNormalize playing well with the normalize_obs function, then I don't have an idea as to a specific issue line (because that's what I originally thought was causing the error).
VecNormalize
just calls normalize_obs
internally on the wrapped observations (as well as updating the internal statistics). So they should return the same values. That said, someone else should definitely take a look at this code, since it's clearly a hotspot for problems.
My gut still tells me that there's something fishy about a mismatch in normalization for two reasons: 1) I originally tried a hack where I implemented a division by a constant (to force my specific obs to be between 0 and 1) on both the reward network and the RL network in stable-baselines3, and that worked fine albeit is clearly unscalable; 2) normalization with PPO works perfectly as well.
We have two VecNormalize
, one of which is normalizing just the observations and the other just the rewards. Can you try:
1) Turning off the reward normalization one (python -m imitation.scripts.train_adversarial with algorithm_kwargs.shared.normalize_reward=False
plus your usual arguments)? This should replicate your manual rescaling.
2) Also what happens with algorithm_kwargs.shared.debug_use_ground_truth=True
-- it should be basically the same as running PPO at that point?
I don't know what's causing gail to work but not airl though, so maybe the airl issue is entirely separate from normalization. Thanks for looking into this Adam.
It's hard to know, GAIL is generally more stable, but that can also sometimes mask bugs.
VecNormalize
just callsnormalize_obs
internally on the wrapped observations (as well as updating the internal statistics). So they should return the same values. That said, someone else should definitely take a look at this code, since it's clearly a hotspot for problems.
Doesn't VecNormalize
(on line 155) calculate the running mean/sd whereas _torchify_with_space
calls the preprocess_obs
to normalize by the observation_space type? In cases where our training env is normalized with the first while our generator/expert samples are normalized by the latter (say, for Image or Discrete obs or action spaces), there would be mismatch, no?
We have two
VecNormalize
, one of which is normalizing just the observations and the other just the rewards. Can you try:
- Turning off the reward normalization one (
python -m imitation.scripts.train_adversarial with algorithm_kwargs.shared.normalize_reward=False
plus your usual arguments)? This should replicate your manual rescaling.
normalize_reward does not appear to be a valid arg.
- Also what happens with
algorithm_kwargs.shared.debug_use_ground_truth=True
-- it should be basically the same as running PPO at that point?
Yep - works great, same as ppo (achieves max reward + very quick to converge).
VecNormalize
just callsnormalize_obs
internally on the wrapped observations (as well as updating the internal statistics). So they should return the same values. That said, someone else should definitely take a look at this code, since it's clearly a hotspot for problems.Doesn't
VecNormalize
(on line 155) calculate the running mean/sd whereas_torchify_with_space
calls thepreprocess_obs
to normalize by the observation_space type? In cases where our training env is normalized with the first while our generator/expert samples are normalized by the latter (say, for Image or Discrete obs or action spaces), there would be mismatch, no?
I believe in both cases the observations are first normalized using VecNormalize
and then preprocess_obs
is called.
For reward/discriminator training, in https://github.com/HumanCompatibleAI/imitation/blob/343eb25c551909c92af50a6c6d0162d4b31e1cb1/src/imitation/algorithms/adversarial.py#L403 we call normalize_obs
on the VecNormalize
object and then in https://github.com/HumanCompatibleAI/imitation/blob/343eb25c551909c92af50a6c6d0162d4b31e1cb1/src/imitation/algorithms/adversarial.py#L419 we call _torchify_with_space
which calls preprocess_obs
.
For reward inference, venv_norm_obs
https://github.com/HumanCompatibleAI/imitation/blob/343eb25c551909c92af50a6c6d0162d4b31e1cb1/src/imitation/algorithms/adversarial.py#L158 occurs before venv_wrapped
, so self.discrim.predict_reward_train
sees already normalized observations. This function then later calls preprocess_obs
: https://github.com/HumanCompatibleAI/imitation/blob/343eb25c551909c92af50a6c6d0162d4b31e1cb1/src/imitation/rewards/discrim_nets.py#L161
I believe these are the only places we use observations. Please let me know if I've missed something.
normalize_reward does not appear to be a valid arg.
This works for me:
python -m imitation.scripts.train_adversarial with airl algorithm_kwargs.shared.normalize_reward=False
What command are you running? Is it on the latest commit to that PR?
Yep - works great, same as ppo (achieves max reward + very quick to converge).
OK, good to know, so our RL training is not broken at least.
This works for me:
python -m imitation.scripts.train_adversarial with airl algorithm_kwargs.shared.normalize_reward=False
On the correct commit now - normalize_reward=False leads to this for AIRL
Since we seem to be in agreement the bug that this issue described doesn't exist I'm closing it. Please feel free to re-open if we detect any actual duplicate preprocessing, or open a new issue if we uncover anything else that could explain AIRL's current under-performance.
AIRL is broken because its inputs are accidentally preprocessed twice by SB helper functions.