average_loss always 0 when using episodic_replay=True (DQN)

kfeeeeee commented 7 years ago

Trying this two different q_functions:

(non recurrent)

class QFunction(chainer.Chain, StateQFunction):

        def __init__(self, n_input_channels=3, n_actions = 4, bias=0.1):
            self.n_actions = n_actions
            self.n_input_channels = n_input_channels
            conv_layers = chainer.ChainList(
                L.Convolution2D(n_input_channels, 32, 8, stride=4, bias=bias),
                L.Convolution2D(32, 64, 4, stride=2, bias=bias),
                L.Convolution2D(64, 64, 3, stride=1, bias=bias),
                L.Convolution2D(64, 128, 7, stride=1, bias=bias)
                )

            lin_layer = L.Linear(128, 128)                     

            a_stream = MLP(128,n_actions,[2])
            v_stream = MLP(128,1,[2])

            super().__init__(conv_layers=conv_layers, lin_layer=lin_layer, a_stream=a_stream,v_stream=v_stream)

        def __call__(self, x, test=False):
            """
            Args:
                x (ndarray or chainer.Variable): An observation
                test (bool): a flag indicating whether it is in test mode
            """
            h = x
            for l in self.conv_layers:
                h = F.relu(l(h))
            h = self.lin_layer(h)

            batch_size = x.shape[0]
            ya = self.a_stream(h, test=test)
            mean = F.reshape(F.sum(ya,axis=1) / self.n_actions, (batch_size,1))
            ya, mean = F.broadcast(ya,mean)
            ya -= mean

            ys = self.v_stream(h,test=test)

            ya,ys = F.broadcast(ya,ys)
            q = ya+ys
            return chainerrl.action_value.DiscreteActionValue(q)

(recurrent)

class QFunctionRecurrent(chainer.Chain, StateQFunction):

        def __init__(self, n_input_channels=3, n_actions = 4, bias=0.1):
            self.n_actions = n_actions
            self.n_input_channels = n_input_channels
            conv_layers = chainer.ChainList(
                L.Convolution2D(n_input_channels, 32, 8, stride=4, bias=bias),
                L.Convolution2D(32, 64, 4, stride=2, bias=bias),
                L.Convolution2D(64, 64, 3, stride=1, bias=bias),
                L.Convolution2D(64, 128, 7, stride=1, bias=bias)
                )

            lstm_layer = L.LSTM(128, 128)                     

            a_stream = MLP(128,n_actions,[2])
            v_stream = MLP(128,1,[2])

            super().__init__(conv_layers=conv_layers, lstm_layer=lstm_layer, a_stream=a_stream,v_stream=v_stream)

        def __call__(self, x, test=False):
            """
            Args:
                x (ndarray or chainer.Variable): An observation
                test (bool): a flag indicating whether it is in test mode
            """
            h = x
            for l in self.conv_layers:
                h = F.relu(l(h))
            h = self.lstm_layer(h)

            batch_size = x.shape[0]
            ya = self.a_stream(h, test=test)
            mean = F.reshape(F.sum(ya,axis=1) / self.n_actions, (batch_size,1))
            ya, mean = F.broadcast(ya,mean)
            ya -= mean

            ys = self.v_stream(h,test=test)

            ya,ys = F.broadcast(ya,ys)
            q = ya+ys
            return chainerrl.action_value.DiscreteActionValue(q)

I found that for the non-recurrent version the loss is not zero and the agent will eventually master the gym environment provided.

However, changing nothing else than adding an lstm layer and setting episodic_replay to True the average_loss will become 0 all the time and the agents is not able to learn to better interact with its environent.

First, I thought that this was due to some kind of rounding issues so I set the minibatch_size=1, episodic_update_len = 1 (assuming that one episodic replay will now only containg one time step) but still no changes.

I wonder if this is some kind of bug or (which I think is more likely) an error on my side.

Any help is very much appreciated!

muupan commented 7 years ago

Hi, you need to make sure your model implements chainerrl.recurent.Recurrent interface so that can be treated as a recurrent model. I guess the easiest way to do it is inheriting chainer.recurrent.RecurrentChainMixin like

class QFunctionRecurrent(chainer.Chain, StateQFunction, RecurrentChainMixin):

, which will find L.LSTM by searching recursively in chainer.Chain and chainer.ChainList.

Documentation on the usage of recurrent models is almost missing, so I opened another issue for it #83. Thanks for reporting the issue!

kfeeeeee commented 7 years ago

Hi, thank you very much for your fast response! That indeed solved the issue.

Just for clarification:

If episodic_replay=true, then:

minibatch_size corresponds to the episodes used for the experience replay

and

episodic_update_len corresponds to the time steps within each of those episodes, right?

Thus, if one batch has e.g. 50 time steps and episodic_update_len=16 it will draw 16 consecutive time steps from this episodes for replay? Furthermore if episodic_update_len=None it will use all time steps within this episode?

Thanks again!

kfeeeeee commented 7 years ago

Unfortunately, my above statement about the issue being solved was a bit hastily.

I modified the the Q Function, i.e.:

class QFunction(chainer.Chain, StateQFunction,RecurrentChainMixin):

        def __init__(self, n_input_channels=3, n_actions = 4, bias=0.1):
            self.n_actions = n_actions
            self.n_input_channels = n_input_channels
            conv_layers = chainer.ChainList(
                L.Convolution2D(n_input_channels, 32, 8, stride=4, bias=bias),
                L.Convolution2D(32, 64, 4, stride=2, bias=bias),
                L.Convolution2D(64, 64, 3, stride=1, bias=bias),
                L.Convolution2D(64, 128, 7, stride=1, bias=bias)
                )

            lstm_layer = L.LSTM(128, 128)                     

            a_stream = MLP(128,n_actions,[2])
            v_stream = MLP(128,1,[2])

            super().__init__(conv_layers=conv_layers, lstm_layer=lstm_layer, a_stream=a_stream,v_stream=v_stream)

        def __call__(self, x, test=False):
            """
            Args:
                x (ndarray or chainer.Variable): An observation
                test (bool): a flag indicating whether it is in test mode
            """
            h = x
            for l in self.conv_layers:
                h = F.relu(l(h))
            h = self.lstm_layer(h)

            batch_size = x.shape[0]
            ya = self.a_stream(h, test=test)
            mean = F.reshape(F.sum(ya,axis=1) / self.n_actions, (batch_size,1))
            ya, mean = F.broadcast(ya,mean)
            ya -= mean

            ys = self.v_stream(h,test=test)

            ya,ys = F.broadcast(ya,ys)
            q = ya+ys
            return chainerrl.action_value.DiscreteActionValue(q)

with

episodic_replay = True
minibatch_size = 4
episodic_update_len = None

but still, the average_loss is 0, whereas it isn't in the non-recurrent case.

However, there is a good chance that this is due to some bug in my code.

Below I have attached the full source code (without the environment) based on the train_dqn_gym.py which you have provided .

Thanks again!

def main():
    import logging
    logging.basicConfig(level=logging.INFO)

    parser = argparse.ArgumentParser()
    parser.add_argument('--outdir', type=str, default='dqn_out')
    parser.add_argument('--env', type=str, default='Pendulum-v0')
    parser.add_argument('--seed', type=int, default=None)
    parser.add_argument('--gpu', type=int, default=0)
    parser.add_argument('--final-exploration-steps',
                        type=int, default= 1000*50)
    parser.add_argument('--start-epsilon', type=float, default=1.0)
    parser.add_argument('--end-epsilon', type=float, default=.05)
    parser.add_argument('--demo', action='store_true', default=False)
    parser.add_argument('--load', type=str, default=None)
    parser.add_argument('--steps', type=int, default=500000)
    parser.add_argument('--prioritized-replay', action='store_true')
    parser.add_argument('--episodic-replay',type=bool, default=True)
    parser.add_argument('--replay-start-size', type=int, default=None)
    parser.add_argument('--target-update-frequency', type=int, default=1)
    parser.add_argument('--target-update-method', type=str, default='soft')
    parser.add_argument('--soft-update-tau', type=float, default=0.001)
    parser.add_argument('--update-frequency', type=int, default=1)
    parser.add_argument('--eval-n-runs', type=int, default=10)
    parser.add_argument('--eval-frequency', type=int, default=50*10)
    parser.add_argument('--n-hidden-channels', type=int, default=100)
    parser.add_argument('--n-hidden-layers', type=int, default=2)
    parser.add_argument('--gamma', type=float, default=0.99)
    parser.add_argument('--minibatch-size', type=int, default=None)
    parser.add_argument('--render-train', action='store_true')
    parser.add_argument('--render-eval', action='store_true')
    parser.add_argument('--monitor', action='store_true')
    parser.add_argument('--reward-scale-factor', type=float, default=.1)
    args = parser.parse_args()

    args.outdir = experiments.prepare_output_dir(
        args, args.outdir, argv=sys.argv)
    print('Output files are saved in {}'.format(args.outdir))

    if args.seed is not None:
        misc.set_random_seed(args.seed)

    def clip_action_filter(a):
        return np.clip(a, action_space.low, action_space.high)

    def make_env(for_eval):
        env = gym.make(args.env)
        if args.monitor:
            env = gym.wrappers.Monitor(env, args.outdir)
        if isinstance(env.action_space, spaces.Box):
            misc.env_modifiers.make_action_filtered(env, clip_action_filter)
        if not for_eval:
            misc.env_modifiers.make_reward_filtered(
                env, lambda x: x * args.reward_scale_factor)
        if ((args.render_eval and for_eval) or
                (args.render_train and not for_eval)):
            misc.env_modifiers.make_rendered(env)
        return env

    env = make_env(for_eval=False)
    timestep_limit = env.spec.tags.get(
        'wrapper_config.TimeLimit.max_episode_steps')
    obs_size = env.observation_space.low.size
    action_space = env.action_space

    n_actions = action_space.n

    class QFunction(chainer.Chain, StateQFunction,RecurrentChainMixin):

        def __init__(self, n_input_channels=3, n_actions = 4, bias=0.1):
            self.n_actions = n_actions
            self.n_input_channels = n_input_channels
            conv_layers = chainer.ChainList(
                L.Convolution2D(n_input_channels, 32, 8, stride=4, bias=bias),
                L.Convolution2D(32, 64, 4, stride=2, bias=bias),
                L.Convolution2D(64, 64, 3, stride=1, bias=bias),
                L.Convolution2D(64, 128, 7, stride=1, bias=bias)
                )

            lstm_layer = L.LSTM(128, 128)                     

            a_stream = MLP(128,n_actions,[2])
            v_stream = MLP(128,1,[2])

            super().__init__(conv_layers=conv_layers, lstm_layer=lstm_layer, a_stream=a_stream,v_stream=v_stream)

        def __call__(self, x, test=False):
            """
            Args:
                x (ndarray or chainer.Variable): An observation
                test (bool): a flag indicating whether it is in test mode
            """
            h = x
            for l in self.conv_layers:
                h = F.relu(l(h))
            h = self.lstm_layer(h)

            batch_size = x.shape[0]
            ya = self.a_stream(h, test=test)
            mean = F.reshape(F.sum(ya,axis=1) / self.n_actions, (batch_size,1))
            ya, mean = F.broadcast(ya,mean)
            ya -= mean

            ys = self.v_stream(h,test=test)

            ya,ys = F.broadcast(ya,ys)
            q = ya+ys
            return chainerrl.action_value.DiscreteActionValue(q)

    explorer = explorers.LinearDecayEpsilonGreedy(
    args.start_epsilon, args.end_epsilon, args.final_exploration_steps,
    action_space.sample)
    q_func = QFunction(3,4)

    opt = optimizers.Adam()
    opt.setup(q_func)

    rbuf_capacity = 100000
    if args.episodic_replay:
        print('episodic replay')
        if args.minibatch_size is None:
            args.minibatch_size = 4
        if args.replay_start_size is None:
            args.replay_start_size = 10
        if args.prioritized_replay:
            betasteps = \
                (args.steps - timestep_limit * args.replay_start_size) \
                // args.update_frequency
            rbuf = replay_buffer.PrioritizedEpisodicReplayBuffer(
                rbuf_capacity, betasteps=betasteps)
        else:
            rbuf = replay_buffer.EpisodicReplayBuffer(rbuf_capacity)
    else:
        if args.minibatch_size is None:
            args.minibatch_size = 32
        if args.replay_start_size is None:
            args.replay_start_size = 1000
        if args.prioritized_replay:
            betasteps = (args.steps - args.replay_start_size) \
                // args.update_frequency
            rbuf = replay_buffer.PrioritizedReplayBuffer(
                rbuf_capacity, betasteps=betasteps)
        else:
            rbuf = replay_buffer.ReplayBuffer(rbuf_capacity)

    def phi(obs):
        return (np.swapaxes(obs,0,2).astype(np.float32))/255.

    gym.undo_logger_setup()  # Turn off gym's default logger settings
    logging.basicConfig(level=logging.DEBUG, stream=sys.stdout, format='')

    agent = DoubleDQN(q_func, opt, rbuf, gpu=args.gpu, gamma=args.gamma,
                explorer=explorer, replay_start_size=args.replay_start_size,
                target_update_interval=args.target_update_frequency,
                update_interval=args.update_frequency,
                phi=phi, minibatch_size=args.minibatch_size,
                target_update_method=args.target_update_method,
                soft_update_tau=args.soft_update_tau,
                episodic_update=args.episodic_replay, episodic_update_len=None)

    if args.load:
        agent.load(args.load)

    eval_env = make_env(for_eval=True)

    if args.demo:
        mean, median, stdev = experiments.eval_performance(
            env=eval_env,
            agent=agent,
            n_runs=args.eval_n_runs,
            max_episode_len=50)
        print('n_runs: {} mean: {} median: {} stdev'.format(
            args.eval_n_runs, mean, median, stdev))
    else:
        experiments.train_agent_with_evaluation(
            agent=agent, env=env, steps=args.steps,
            eval_n_runs=args.eval_n_runs, eval_interval=args.eval_frequency,
            outdir=args.outdir, eval_env=eval_env,
            max_episode_len=50)

if __name__ == '__main__':
    main()

muupan commented 7 years ago

@kfeeeeee Can you give me the complete code (including import sentences) and command line arguments?

kfeeeeee commented 7 years ago

@muupan Sure thing.

Gym-Environment: gridworld.py

Train-Script: train_dqn_gym.py

And command execution: python train_dqn_gym.py --env 'Gridworld-v0'

For the rest I use the default args defined in train_dqn_gym.py

Thanks again for looking into this!

muupan commented 7 years ago

Thanks for your code!

Just for clarification:

If episodic_replay=true, then:

minibatch_size corresponds to the episodes used for the experience replay

and

episodic_update_len corresponds to the time steps within each of those episodes, right?

Thus, if one batch has e.g. 50 time steps and episodic_update_len=16 it will draw 16 consecutive time steps from this episodes for replay? Furthermore if episodic_update_len=None it will use all time steps within this episode?

You are correct. minibatch_size is the number of episodes to sample for an update. Each sample episode's length is at most episodic_update_len.

As for average_loss, it turned out to be a bug in ChainerRL. Losses are computed and the model is updated as usual. However, the value of average_loss is not updated at all when episodic_update=True. I'll open an issue for it and fix it soon. Thanks for reporting it!

kfeeeeee commented 7 years ago

Thank you very much for your effort!

chainer / chainerrl

average_loss always 0 when using episodic_replay=True (DQN) #82