astooke / rlpyt

Reinforcement Learning in PyTorch
MIT License
2.23k stars 325 forks source link

DQN in CartPole-v0 doesn't learn #135

Closed kargarisaac closed 4 years ago

kargarisaac commented 4 years ago

I'm trying to use rlpyt with my custom env with a non-image input state. For that, I first want to test it on a simple env, like CartPole-v0. And I use DQN and DqnAgent. But I get this error:

Traceback (most recent call last):
  File "/home/isaac/codes/dd-zero/rlpyt/examples/example_9.py", line 178, in <module>
    resume_chkpnt=None,
  File "/home/isaac/codes/dd-zero/rlpyt/examples/example_9.py", line 125, in build_and_train
    runner.train()
  File "/home/isaac/codes/dd-zero/rlpyt/rlpyt/runners/minibatch_rl.py", line 252, in train
    n_itr = self.startup()
  File "/home/isaac/codes/dd-zero/rlpyt/rlpyt/runners/minibatch_rl.py", line 81, in startup
    world_size=world_size,
  File "/home/isaac/codes/dd-zero/rlpyt/rlpyt/samplers/serial/sampler.py", line 47, in initialize
    global_B=global_B, env_ranks=env_ranks)
  File "/home/isaac/codes/dd-zero/rlpyt/rlpyt/agents/dqn/dqn_agent.py", line 37, in initialize
    global_B=global_B, env_ranks=env_ranks)
  File "/home/isaac/codes/dd-zero/rlpyt/rlpyt/agents/base.py", line 91, in initialize
    **self.model_kwargs)
TypeError: 'NoneType' object is not callable

The code is:

def build_and_train(run_ID=0, cuda_idx=None, resume_chkpnt=None):
    env_id = 'CartPole-v0'
    sampler = SerialSampler(
        EnvCls=gym_make,
        env_kwargs=dict(id=env_id),
        eval_env_kwargs=dict(id=env_id), 
        # env_kwargs=dict(id='CartPole-v0'),
        # eval_env_kwargs=dict(id='CartPole-v0'),
        batch_T=4,  # One time-step per sampler iteration.
        batch_B=8,  # One environment (i.e. sampler Batch dimension).
        max_decorrelation_steps=100,
        eval_n_envs=10,
        eval_max_steps=int(50e3),
        eval_max_trajectories=50,
    )
    algo = DQN(
        min_steps_learn=int(1e2),
        replay_size=int(1e4),
        replay_ratio=32,  # data_consumption / data_generation.
        learning_rate=0.01,
        double_dqn=True,
        ReplayBufferCls=UniformReplayBuffer,  # Leave None to select by above options.
    )
    agent = DqnAgent()
    runner = MinibatchRl(
        algo=algo,
        agent=agent,
        sampler=sampler,
        n_steps=1e6,
        log_interval_steps=1e3,
        affinity=dict(cuda_idx=cuda_idx, workers_cpus=[0,1,2,3,4,5,6])
    )
    config = dict(env_id=env_id)
    algo_name = 'dqn_'
    name = algo_name + env_id
    log_dir = algo_name + "cartpole"
    with logger_context(log_dir, run_ID, name, config, snapshot_mode='last'):
        runner.train()

But the ModelCls is None in DqnAgent and that's the reason for the error, I think. So I wrote an agent and model like this and used it instead of DqnAgent:

class CustomMixin:
    def make_env_to_model_kwargs(self, env_spaces):
        return dict(observation_shape=env_spaces.observation.shape,
                    output_size=env_spaces.action.n)
from rlpyt.agents.dqn.dqn_agent import DqnAgent
from rlpyt.utils.buffer import buffer_to
from rlpyt.agents.base import AgentStep
from rlpyt.utils.collections import namedarraytuple
import torch
AgentInfo = namedarraytuple("AgentInfo", "q")
class CustomDqnAgent(CustomMixin, DqnAgent):
    def __init__(self, ModelCls=DCustomDqnModel, **kwargs):
        super().__init__(ModelCls=ModelCls, **kwargs)
import torch
from rlpyt.utils.tensor import infer_leading_dims, restore_leading_dims
from rlpyt.models.mlp import MlpModel
from rlpyt.models.dqn.dueling import DuelingHeadModel
class CustomDqnModel(torch.nn.Module):
    def __init__(
            self,
            observation_shape,
            output_size,
            fc_sizes=128,
            dueling=False,
        ):
        """Instantiates the neural network according to arguments; network defaults
        stored within this method."""
        super().__init__()
        self.dueling = dueling
        self._obs_ndim = len(observation_shape)
        input_shape = observation_shape[0]
        self.base_net = torch.nn.Sequential(
            torch.nn.Linear(input_shape, fc_sizes),
            torch.nn.ReLU(),
            torch.nn.Linear(fc_sizes, fc_sizes),
            torch.nn.ReLU(),
            torch.nn.Linear(fc_sizes, output_size),
        )
    def forward(self, observation, prev_action, prev_reward):
        observation = observation.type(torch.float)
        lead_dim, T, B, obs_shape = infer_leading_dims(observation, self._obs_ndim)
        obs = observation.view(T * B, -1)
        q = self.base_net(obs)
        q = restore_leading_dims(q, lead_dim, T, B)
        return q

It runs fine, but It doesn't learn.

image

The plot is similar even after 20,000,000 steps. I checked the code several times and tested different configs for several days.

Do you have any idea to solve this problem?

Update: The CartPole-v0 has discrete action space. +1 or -1 for action (two actions). DDPG and SAC work fine for my custom env with continues action space. I try to discretize the action space. I trained it using DQN from stable baseline and my pure pytorch implementation and it works. But I couldn't train it using rlpyt and decided to first try on CartPole-v0. Do you see any problem in my code for an env with a discrete action space?

astooke commented 4 years ago

OK thanks for the update clarifications!

The one thing I can think of that's missing from your configuration is to do with the epsilon-greedy schedule. Look in the EpsilonGreedyAgent (a base class for DqnAgent): https://github.com/astooke/rlpyt/blob/master/rlpyt/agents/dqn/epsilon_greedy.py and check whether you are providing a good schedule for epsilon, and good starting and ending values?

You'll probably also want to increase min_steps_learn to something like 1e3 or 1e4, to populate the replay buffer with lots of random samples before starting to learn. Or whatever setting you used in your other implementation?

Another thing would be to use MinibatchRlEval instead of MinibatchRl for the runner. Only the "eval" one will pause training to run the agent with a different value for epsilon and report those scores.

Let us know if either of those help?

This is interesting, I haven't actually run cartpole myself, would be good to see what settings work.

DanielTakeshi commented 4 years ago

FWIW from my experience with CartPole I'm not actually sure if DQN does well at that. DQN seems to strangely be more reliable on Pong than on CartPole, but I might not have settled on ideal hyperparameters. I usually verify DQN code by running on Pong.

kargarisaac commented 4 years ago

I tried several configurations for cartplole, but it didn't learn. Finally, I decided to test pong using custom agent and model, not from rlpyt, to see if my code is wrong. I just used the resized rgb image as input and the same configuration for sampler, algo, and runner but again without any sign of learning. It seems that my code has a problem that I cannot find it. Here is my code. I would be grateful if you can take a look at that.

kargarisaac commented 4 years ago

Finally, the problem is solved. The replay buffer setting is not correct for DQN and non-frame environments. It sets it to frame versions, but when I set it to UniformReplayBuffer, it works perfectly. I will clean the code and add one example and make a pull request.