[Bug] Performance differences between normal and masked PPO

tyler-ingebrand commented 1 year ago

Describe the bug I am getting different convergence results for PPO and MaskedPPO on the same environment. The mask is entirely permissive, IE all actions are allowed for MaskedPPO. I am doing this to verify convergence properties before adding in my mask. I think learning curves should be similar, unless I am misunderstanding something. I've tested on multiple seeds with the same results.

See training graphs below: Masked: Masked_PPO Normal: Normal_ppO

My environment is a minigrid. Observation space is an image, action space is discrete(7). I will summarize my code below. Providing my exact env would be impractical, and I do not think the env is the issue.

Code example

env = gym.make(...) # my custom minigrid
env = RGBImgObsWrapper(env, tile_size=tile_size) 
env = ImgObsWrapper(env) 
env = MiniWrapper(env)  # the above wrappers are minigrid specific
def mask_fn(env: gym.Env) -> np.ndarray:
    return np.ones(7)

# exchange this for PPO for the normal PPO case, and dont use ActionMasker; Otherwise everything is the same
env = ActionMasker(env, mask_fn)  # Wrap to enable masking
model = MaskablePPO(MaskableActorCriticPolicy, env, verbose=1) 
model.learn(total_timesteps=200_000)
# graphs, visualization, etc

System Info Installed with pip in a venv. sb3-contrib 1.7.0 stable-baselines3 1.7.0

GPU models and configuration Gefore RTX 2060
Python version 3.10.9
PyTorch version torch 1.13.1
Gym version gym 0.26.0 gym-minigrid 1.2.2 / gym-notices 0.0.8
Versions of any other relevant libraries matplotlib 3.6.2 numpy 1.24.1 nvidia-cublas-cu11 11.10.3.66 nvidia-cuda-nvrtc-cu11 11.7.99 nvidia-cuda-runtime-cu11 11.7.99 nvidia-cudnn-cu11 8.5.0.96 pip 22.3.1 tf 1.12.1 tf_conversions 1.12.1 tf2_geometry_msgs 0.6.5 tf2_kdl 0.6.5 tf2_py 0.6.5 tf2_ros 0.6.5

Additional context Is there any advice on how to debug this? Is this expected behavior due to a required change in the masked version of the algorithm? Hyper parameters are all the same, allowed actions in every state is the same, so intuitively I would expect the same results.

My minigrid does have sparse reward, so maybe exploration is different between algorithms?

Thank you for your help

tyler-ingebrand commented 1 year ago

About the only difference I can see in the implementations is:

:param use_sde: Whether to use generalized State Dependent Exploration (gSDE)
        instead of action noise exploration (default: False)

Given that this defaults to False, I would think it should have no effect.

tyler-ingebrand commented 1 year ago

Whoops, found the issue. I forgot to specify CNN as the policy architecture, so it was flattening my image into a vector. My bad. It would be nice however if it threw a warning when the observation is an image and the first layer is a flatten layer.

Stable-Baselines-Team / stable-baselines3-contrib

[Bug] Performance differences between normal and masked PPO #140