Stable-Baselines-Team / stable-baselines3-contrib

Contrib package for Stable-Baselines3 - Experimental reinforcement learning (RL) code
https://sb3-contrib.readthedocs.io
MIT License
459 stars 168 forks source link

[Bug] Performance differences between normal and masked PPO #140

Closed tyler-ingebrand closed 1 year ago

tyler-ingebrand commented 1 year ago

Describe the bug I am getting different convergence results for PPO and MaskedPPO on the same environment. The mask is entirely permissive, IE all actions are allowed for MaskedPPO. I am doing this to verify convergence properties before adding in my mask. I think learning curves should be similar, unless I am misunderstanding something. I've tested on multiple seeds with the same results.

See training graphs below: Masked: Masked_PPO Normal: Normal_ppO

My environment is a minigrid. Observation space is an image, action space is discrete(7). I will summarize my code below. Providing my exact env would be impractical, and I do not think the env is the issue.

Code example

env = gym.make(...) # my custom minigrid
env = RGBImgObsWrapper(env, tile_size=tile_size) 
env = ImgObsWrapper(env) 
env = MiniWrapper(env)  # the above wrappers are minigrid specific
def mask_fn(env: gym.Env) -> np.ndarray:
    return np.ones(7)

# exchange this for PPO for the normal PPO case, and dont use ActionMasker; Otherwise everything is the same
env = ActionMasker(env, mask_fn)  # Wrap to enable masking
model = MaskablePPO(MaskableActorCriticPolicy, env, verbose=1) 
model.learn(total_timesteps=200_000)
# graphs, visualization, etc

System Info Installed with pip in a venv. sb3-contrib 1.7.0 stable-baselines3 1.7.0

Additional context Is there any advice on how to debug this? Is this expected behavior due to a required change in the masked version of the algorithm? Hyper parameters are all the same, allowed actions in every state is the same, so intuitively I would expect the same results.

My minigrid does have sparse reward, so maybe exploration is different between algorithms?

Thank you for your help

tyler-ingebrand commented 1 year ago

About the only difference I can see in the implementations is:

:param use_sde: Whether to use generalized State Dependent Exploration (gSDE)
        instead of action noise exploration (default: False)

Given that this defaults to False, I would think it should have no effect.

tyler-ingebrand commented 1 year ago

Whoops, found the issue. I forgot to specify CNN as the policy architecture, so it was flattening my image into a vector. My bad. It would be nice however if it threw a warning when the observation is an image and the first layer is a flatten layer.