[feature request] DQN Clipped and DQN Reg Algorithms

AurelianTactics commented 3 years ago

The paper 'Evolving RL Algorithms' (https://arxiv.org/abs/2101.03958) uses evolution strategies to find new modifications of DQN. The paper reports the two best found algorithms DQN Clipped and DQN Reg. I modified the main stable baselines 3 to DQN implementation to have those modifications. Is there any interest in me formally implementing DQN Clipped and DQN Reg through SB3 Contrib and doing a pull request? It's kind of an obscure paper and the modifications are kind of slight so I'm not sure if there's much interest in it.

The paper calls DQN Clipped and Reg new algorithms but they are basically slight modifications to the DQN loss function (example change for DQN Reg replacying the DQN's Huber Loss: https://github.com/AurelianTactics/dqnclipped_dqnreg_prelim_implementation/blob/main/utils.py#L382). My proposed implementation:

Create a new learning algorithm. Basically DQN with two new hyperparameters. One to decide on the loss type (standard, Clipped, Reg), and the other for the DQN Reg hyperparameter.
Test the results on the four classic control algorithms reported and plotted in the paper: CartPole, Acrobot, LunarLander, and MountainCar. Run 5 random seeds and use the EvalCallback to get the numpy results, plot those results against the paper's plots to compare. The paper also tested against MiniGrid and did some Atari tests for DQN Reg but I'd prefer if the four Classic Control results are enough.

Aside note: is there an example using EvalCallback for Atari for SB3 (main repo not contrib)? I was able to get EvalCallback working eventually for Atari but I had to modify some code in EvalCallback. The SB3 code would take my train env and run it through VecTransposeImage. However the eval env would be run through EvalCallback and not get the same change so I would get a warning and then an error. I fixed it by having EvalCallback do the same check to see if a VecTransposeImage was needed and then do the transpose (if needed) but I feel like there's probably a more standard, user friendly way.

araffin commented 3 years ago

Hello,

thanks for the proposal.

Create a new learning algorithm. Basically DQN with two new hyperparameters. One to decide on the loss type (standard, Clipped, Reg), and the other for the DQN Reg hyperparameter.

if I get it, the proposed algorithm just changes the loss function but nothing else? In that case, I would implement it as a child class of DQN and just override the train() method.

Test the results on the four classic control algorithms reported and plotted in the paper:

testing results on classic control is required but not enough (they are two simple to detect any problem). We would need to test it on at least two atari games (+ minigrid if it does not require dict obs, see https://github.com/DLR-RM/stable-baselines3/pull/243), I can help with that ;) And before doing the PR, make sure to read the contributing guide of SB3 contrib carefully ;)

Aside note: is there an example using EvalCallback for Atari for SB3 (main repo not contrib)?

there was a bug that was fixed recently, please upgrade to the latest SB3 version, I recommend to use master version ;) (see https://github.com/DLR-RM/stable-baselines3/pull/336 for bug fix and doc for installing master version)

Miffyli commented 3 years ago

To extend bit on arrafin's comment:

We highly recommend using zoo to run the experiments (it does all the environment wrapping, result gathering and plotting for you). It will also make it easier to replicate the results in future.
Looking at Table 1 of the paper it seems RoadRunner and Asteroid are good environments to test out the DQNReg. Results with these two environments should suffice for Atari experiments.

araffin commented 3 years ago

One last remark: do we really need the two? It looks like DQNReg is both simpler and working as good (or better) on all the tasks, no?

Miffyli commented 3 years ago

One last remark: do we really need the two?

~~If we have results from other environments then I think we are fine with just one Atari game.~~

Edit: Oh darn I confused things. We could start with DQNReg and see if people have energy to add the other one, in that case :).

araffin commented 3 years ago

Looking at Table 1 of the paper it seems RoadRunner and Asteroid are good environments to test out the DQNReg. Results with these two environments should suffice for Atari experiments.

I've launched jobs with DQN, QR-DQN, A2C and PPO on those ones (btw, the name is "Asteroids" with an "s" ;), I had to restart jobs because of that ^^"), and I will pushed the trained models here: https://github.com/DLR-RM/rl-baselines3-zoo/pull/69

AurelianTactics commented 3 years ago

Thanks for the feedback. I'll try to put something together.

if I get it, the proposed algorithm just changes the loss function but nothing else? In that case, I would implement it as a child class of DQN and just override the train() method.

Looks like the latest version of DQN separates the target_q_values and the next_q_values (https://github.com/DLR-RM/stable-baselines3/blob/master/stable_baselines3/dqn/dqn.py#L160) so the only change that is needed is to change the loss function.

AurelianTactics commented 3 years ago

I'm still working on this. I had a few questions. I was able to get the implementation of DQNReg and DQNClipped to work in a contrib fork (https://github.com/AurelianTactics/stable-baselines3-contrib) on the four classic control envs (CartPole, Acrobot, LunarLander, MountainCar). I used zoo to run the experiments. I'm starting the Atari runs today. A few things came up.

I ran all the tests in the tests folder and am having tests fail because the observation_space is None and fails a check (see below). I think what's happening is I'm not getting the env's observation_space information to the model's self.observation_space parameter (how is that usually done?). Maybe I screwed up the inheritance (https://github.com/AurelianTactics/stable-baselines3-contrib/blob/master/sb3_contrib/dqnreg/dqnreg.py)? Maybe I'm running the tests wrong? Maybe there's some extra step needed since the algorithms I implemented use DQNPolicy and don't have their own specific policies.py file created for them?
I can't get the tensorboard logs to save. When I run SB3 locally I don't have an issue but I'm running this on an Amazon Web Services instance (basically this tutorial with a different image: https://aws.amazon.com/getting-started/hands-on/train-deep-learning-model-aws-ec2-containers/) and can't seem to get the logs to save anywhere. Is it fine if I have the other saved outputs from the zoo experiments but not the tensorboard logs?

I also ran all the tests in the tests folder and was able to pass most of them. I failed 4 tests (2 for each algorithm) all with the same general error message:

... tests/test_cnn.py::test_cnn[DQNReg] FAILED
tests/test_cnn.py::test_cnn[DQNClipped] FAILED ... tests/test_save_load.py::test_exclude_include_saved_params[DQNReg] FAILED tests/test_save_load.py::test_exclude_include_saved_params[DQNClipped] FAILED ...

============================================= FAILURES __ test_exclude_include_saved_params[DQNClipped] ___

tmp_path = PosixPath('/tmp/pytest-of-root/pytest-12/test_exclude_include_saved_par3') model_class = <class 'sb3_contrib.dqnclipped.dqnclipped.DQNClipped'>
@pytest.mark.parametrize("model_class", MODEL_LIST)
def test_exclude_include_saved_params(tmp_path, model_class):
    """
    Test if exclude and include parameters of save() work

    :param model_class: (BaseAlgorithm) A RL model
    """
    env = DummyVecEnv([lambda: select_env(model_class)])

    # create model, set verbose as 2, which is not standard
    model = model_class("MlpPolicy", env, policy_kwargs=dict(net_arch=[16]), verbose=2)

    # Check if exclude works
    model.save(tmp_path / "test_save", exclude=["verbose"])
    del model
  model = model_class.load(str(tmp_path / "test_save.zip"))
tests/test_save_load.py:215:

/opt/conda/lib/python3.6/site-packages/stable_baselines3/common/base_class.py:639: in load _init_setup_model=False, # pytype: disable=not-instantiable,wrong-keyword-args sb3_contrib/dqnclipped/dqnclipped.py:61: in init optimize_memory_usage=optimize_memory_usage, /opt/conda/lib/python3.6/site-packages/stable_baselines3/dqn/dqn.py:121: in init self._setup_model() /opt/conda/lib/python3.6/site-packages/stable_baselines3/dqn/dqn.py:124: in _setup_model super(DQN, self)._setup_model() /opt/conda/lib/python3.6/site-packages/stable_baselines3/common/off_policy_algorithm.py:175: in _setup_model optimize_memory_usage=self.optimize_memory_usage, /opt/conda/lib/python3.6/site-packages/stable_baselines3/common/buffers.py:171: in init super(ReplayBuffer, self).init(buffer_size, observation_space, action_space, device, n_envs=n_envs) /opt/conda/lib/python3.6/site-packages/stable_baselines3/common/buffers.py:44: in init self.obs_shape = get_obs_shape(observation_space)

observation_space = None
def get_obs_shape(observation_space: spaces.Space) -> Tuple[int, ...]:
    """
    Get the shape of the observation (useful for the buffers).

    :param observation_space:
    :return:
    """
    if isinstance(observation_space, spaces.Box):
        return observation_space.shape
    elif isinstance(observation_space, spaces.Discrete):
        # Observation is an int
        return (1,)
    elif isinstance(observation_space, spaces.MultiDiscrete):
        # Number of discrete features
        return (int(len(observation_space.nvec)),)
    elif isinstance(observation_space, spaces.MultiBinary):
        # Number of binary features
        return (int(observation_space.n),)
    else:
      raise NotImplementedError(f"{observation_space} observation space is not supported")
E NotImplementedError: None observation space is not supported
/opt/conda/lib/python3.6/site-packages/stable_baselines3/common/preprocessing.py:141: NotImplementedError

araffin commented 3 years ago

ran all the tests in the tests folder and am having tests fail because the observation_space is None and fails a check (see below). I think what's happening is I'm not getting the env's observation_space information to the model's self.observation_space parameter (how is that usually done?).

Could you please open a draft PR?

and can't seem to get the logs to save anywhere. Is it fine if I have the other saved outputs from the zoo experiments but not the tensorboard logs?

tensorboard logs are not needed. Training curves and final results are (using EvalCallback included in the zoo which will save a evaluations.npz file).

It looks like DQNReg is both simpler and working as good (or better) on all the tasks, no?

As mentioned before, we may probably include only one of the two.

Stable-Baselines-Team / stable-baselines3-contrib

[feature request] DQN Clipped and DQN Reg Algorithms #22