MushroomRL / mushroom-rl

Python library for Reinforcement Learning.
MIT License
803 stars 145 forks source link

PPO for lunar lander [BUG] #124

Closed davidenitti closed 1 year ago

davidenitti commented 1 year ago

I'm trying to use the PPO for the lunar lander but I can't find examples and my code doesn't seem to converge, can you spot the issue? some parameter is wrong? alg = PPO

from mushroom_rl.policy import BoltzmannTorchPolicy
policy_params = dict(
    std_0=1.,
    n_features=32,
    use_cuda=torch.cuda.is_available()
)
algorithm_params = dict(
    batch_size=128,
    actor_optimizer=optimizer,
    n_epochs_policy=4,
    eps_ppo=.2, lam=.95,
    critic_params=dict(network=net,
                       optimizer=optimizer,
                       loss=F.mse_loss,
                       n_features=32,
                       batch_size=128,
                       input_shape=mdp.info.observation_space.shape,
                       output_shape=(1,))
    )
beta = Parameter(1e0)
policy = BoltzmannTorchPolicy(net, mdp.info.observation_space.shape,
                              mdp.info.action_space.shape,
                              beta, **policy_params)

agent = alg(mdp.info, policy, **algorithm_params)
[...]
core.learn(n_steps=10000,
                       n_steps_per_fit=3000, quiet=args.quiet)
boris-il-forte commented 1 year ago

probably n_epochs_policy is too low. Set it to at least 10. Beta 1 may also be an issue, you can change it to regulate how explorative the policy should be. Furthermore, I don't understand why you have a standard deviation in your policy parameters. That shouldn't be the case as the policy should be built on logits.

In any case, we don't have any direct experience with PPO and lunar lander discrete, so we cannot provide major help.

If you believe there is a bug in the implementation, let us know and provide a minimal example to reproduce.

davidenitti commented 1 year ago

thanks, I probably kept the wrong parameters from the only PPO example I found. what is exactly n_epochs_policy? and in relation to the option n_steps_per_fit? I read number of policy updates for every dataset;, but I'm not sure what is the dataset in this context

davidenitti commented 1 year ago

I think the mistake is the number of actions in the line:

BoltzmannTorchPolicy(net, mdp.info.observation_space.shape,
                              mdp.info.action_space.shape,
                              beta, **policy_params)

I put the shape by mistake

boris-il-forte commented 1 year ago

yes, definitively this could be a mistake. I overlooked it as I'm not used to work with PPO with discrete actions. Hope this solves the issue.

davidenitti commented 1 year ago

I found out that there is a bug, the shape of old_log_p in ppo.py is 2048, 1, 2048, instead of 2048, 1 or something like that. so there is a bug. I fixed changing

old_log_p = old_pol_dist.log_prob(act)[:, None].detach()

to this

old_log_p = old_pol_dist.log_prob(act[:,0])[:, None].detach()
boris-il-forte commented 1 year ago

Hi, I think this may be an issue due to the finite action setting, as PPO works as expected for continuous actions. I guess we are missing some corner cases when the action is an integer. Can you give me your example file such that I can reproduce the error and fix it?

davidenitti commented 1 year ago
import argparse
import os
import numpy as np
from mushroom_rl.utils.dataset import compute_metrics
from mushroom_rl.algorithms.actor_critic import PPO
from mushroom_rl.core import Core, Logger
from mushroom_rl.environments import *
from mushroom_rl.utils.parameters import Parameter
import networks
import losses

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from functools import partial

class NetworkLeakyDense(nn.Module):

    def __init__(self, input_shape, output_shape, **kwargs):
        super().__init__()
        n_input = input_shape[0]
        n_output = output_shape[0]
        act = kwargs.get("act","lrelu")
        if act == 'lrelu':
            self.act = partial(F.leaky_relu, negative_slope=0.1)
        elif act == 'tanh':
            self.act = F.tanh
        elif act == 'relu':
            self.act = F.relu
        else:
            raise NotImplementedError
        self.n_features = kwargs.get("n_features", 256)
        if self.n_features is None:
            self.n_features = 256
        self.extra_layer = kwargs.get('extra_layer', False)
        self._h1 = nn.Linear(n_input, self.n_features)
        if self.extra_layer:
            self._h2 = nn.Linear(self.n_features, self.n_features)
            self._h3 = nn.Linear(self.n_features, n_output)
            # nn.init.xavier_uniform_(self._h2.weight,
            #                         gain=nn.init.calculate_gain('leaky_relu', 0.1))
            # nn.init.xavier_uniform_(self._h3.weight,
            #                     gain=nn.init.calculate_gain('linear'))
        else:
            self._h2 = nn.Linear(self.n_features, n_output)
        #     nn.init.xavier_uniform_(self._h2.weight,
        #                         gain=nn.init.calculate_gain('linear'))
        # nn.init.xavier_uniform_(self._h1.weight,
        #                         gain=nn.init.calculate_gain('leaky_relu', 0.1))

    def forward(self, state, action=None):
        h = self.act(self._h1(state.float()))
        if hasattr(self,'extra_layer') and self.extra_layer:
            h = self.act(self._h2(h))
            q = self._h3(h)
        else:
            q = self._h2(h)

        if action is None:
            return q
        else:
            q_acted = torch.squeeze(q.gather(1, action.long()))
            return q_acted

def print_epoch(epoch, logger):
    logger.info('################################################################')
    logger.info('Epoch: %d' % epoch)
    logger.info('----------------------------------------------------------------')
###
def get_stats(dataset, logger, epoch=None):
    score = compute_metrics(dataset)
    if len(score) == 4:
        logger.info(('min_reward: %f, max_reward: %f, mean_reward: %f,'
                     ' games_completed: %d' % score))
        return {'min': score[0], 'max': score[1], 'mean': score[2],
                'games': score[3], 'epoch': epoch}
    else:
        logger.info(('min_reward: %f, max_reward: %f, mean_reward: %f,'
                     ' median_reward: %f, games_completed: %d' % score))
        return {'min': score[0], 'max': score[1], 'mean': score[2], 'median': score[3],
                'games': score[4], 'epoch': epoch}
def get_args(args_list=None):
    # Argument parser
    parser = argparse.ArgumentParser()

    arg_game = parser.add_argument_group('Game')
    arg_game.add_argument("--env",
                          type=str,
                          default='LunarLander-v2',
                          help='Gym ID')
    arg_game.add_argument("--network", type=str, default='NetworkLeakyDense')

    arg_game.add_argument("--loss", type=str, default="mse_loss")

    arg_game.add_argument("--clip_grad", type=float, default=None,
                          help='clip gradient by norm')

    arg_game.add_argument("--screen-width", type=int, default=84,
                          help='Width of the game screen.')
    arg_game.add_argument("--screen-height", type=int, default=84,
                          help='Height of the game screen.')

    arg_mem = parser.add_argument_group('Replay Memory')

    arg_net = parser.add_argument_group('Deep Q-Network')
    arg_net.add_argument("--optimizer",
                         choices=['adadelta',
                                  'adam',
                                  'rmsprop',
                                  'rmspropcentered'],
                         default='adam',
                         help='Name of the optimizer to use.')
    arg_net.add_argument("--n_features", type=int, default=None)
    arg_net.add_argument("--extra_layer", action='store_true', default=True)

    arg_net.add_argument("--learning_rate", type=float, default=.0001,
                         help='Learning rate value of the optimizer.')
    arg_net.add_argument("--decay", type=float, default=.95,
                         help='Discount factor for the history coming from the'
                              'gradient momentum in rmspropcentered and'
                              'rmsprop')
    arg_net.add_argument("--momentum", type=float, default=0.,
                         help='momentum rmsprop')
    arg_net.add_argument("--epsilon", type=float, default=1e-4,
                         help='Epsilon term used in rmspropcentered and'
                              'rmsprop')
    arg_net.add_argument("--num_features", type=int, default=64, help='num features')

    arg_alg = parser.add_argument_group('Algorithm')
    arg_alg.add_argument("--algorithm", choices=['DQN', 'DQNPred', 'DoubleDQN', 'adqn', 'mmdqn',
                                                 'cdqn', 'dueldqn', 'ndqn', 'qdqn', 'rainbow', "PPO"],
                         default='PPO',
                         help='Name of the algorithm. dqn is for standard'
                              'DQN, ddqn is for Double DQN and adqn is for'
                              'Averaged DQN.')
    arg_alg.add_argument("--n_approximators", type=int, default=1,
                         help="Number of approximators used in the ensemble for"
                              "AveragedDQN or MaxminDQN.")
    arg_alg.add_argument("--batch_size", type=int, default=32,
                         help='Batch size for each fit of the network.')
    arg_alg.add_argument("--history-length", type=int, default=4,
                         help='Number of frames composing a state.')
    arg_alg.add_argument("--target_update_frequency", type=int, default=10000,
                         help='Number of collected samples before each update'
                              'of the target network.')
    arg_alg.add_argument("--evaluation_frequency", type=int, default=10000,
                         help='Number of collected samples before each'
                              'evaluation. An epoch ends after this number of'
                              'steps')
    arg_mem.add_argument("--lam", type=float, default=.95, help=''),

    arg_alg.add_argument("--train_frequency", type=int, default=2048,
                         help='Number of collected samples before each fit of'
                              'the neural network.')
    arg_alg.add_argument("--max_steps", type=int, default=27500000,
                         help='Total number of collected samples.')
    arg_alg.add_argument("--final_exploration_frame", type=int, default=1000000,
                         help='Number of collected samples until the exploration'
                              'rate stops decreasing.')
    arg_alg.add_argument("--initial-exploration-rate", type=float, default=1.,
                         help='Initial value of the exploration rate.')
    arg_alg.add_argument("--final_exploration_rate", type=float, default=.1,
                         help='Final value of the exploration rate. When it'
                              'reaches this values, it stays constant.')
    arg_alg.add_argument("--test_exploration_rate", type=float, default=.05,
                         help='Exploration rate used during evaluation.')
    arg_alg.add_argument("--test-samples", type=int, default=125000,
                         help='Number of collected samples for each'
                              'evaluation.')
    arg_alg.add_argument("--max-no-op-actions", type=int, default=30,
                         help='Maximum number of no-op actions performed at the'
                              'beginning of the episodes.')
    arg_alg.add_argument("--alpha_coeff", type=float, default=.6,
                         help='Prioritization exponent for prioritized experience replay.')
    arg_alg.add_argument("--n-atoms", type=int, default=51,
                         help='Number of atoms for Categorical DQN.')
    arg_alg.add_argument("--v-min", type=int, default=-10,
                         help='Minimum action-value for Categorical DQN.')
    arg_alg.add_argument("--v-max", type=int, default=10,
                         help='Maximum action-value for Categorical DQN.')
    arg_alg.add_argument("--n-quantiles", type=int, default=200,
                         help='Number of quantiles for Quantile Regression DQN.')
    arg_alg.add_argument("--n_steps_return", type=int, default=1,
                         help='Number of steps for n-step return for Rainbow.')
    arg_alg.add_argument("--sigma-coeff", type=float, default=.5,
                         help='Sigma0 coefficient for noise initialization in'
                              'NoisyDQN and Rainbow.')

    arg_utils = parser.add_argument_group('Utils')

    arg_utils.add_argument('--episode_mode', action='store_true',
                           help='Flag episode mode')

    arg_utils.add_argument('--use_cpu', action='store_false', dest='use_cuda',
                           help='Flag specifying whether to use the GPU.')
    arg_utils.add_argument('--disable_save', action='store_false', dest="save",
                           help='Flag specifying whether to save the model.')
    arg_utils.add_argument('--load-path', type=str,
                           help='Path of the model to be loaded.')
    arg_utils.add_argument('--render', action='store_true',
                           help='Flag specifying whether to render the game.')
    arg_utils.add_argument('--quiet', action='store_true',
                           help='Flag specifying whether to hide the progress'
                                'bar.')
    arg_utils.add_argument('--debug', action='store_true',
                           help='Flag specifying whether the script has to be'
                                'run in debug mode.')
    args = parser.parse_args(args_list)
    return args

def main(args_list=None, callback=None, upload_checkpoint=False):
    args = get_args(args_list)
    print(args)
    np.random.seed()

    scores = list()

    optimizer = dict()
    if args.optimizer == 'adam':
        optimizer['class'] = optim.Adam
        optimizer['params'] = dict(lr=args.learning_rate,
                                   eps=args.epsilon)
    elif args.optimizer == 'adadelta':
        optimizer['class'] = optim.Adadelta
        optimizer['params'] = dict(lr=args.learning_rate,
                                   eps=args.epsilon)
    elif args.optimizer == 'rmsprop':
        optimizer['class'] = optim.RMSprop
        optimizer['params'] = dict(lr=args.learning_rate,
                                   alpha=args.decay,
                                   eps=args.epsilon,
                                   momentum=args.momentum)
    elif args.optimizer == 'rmspropcentered':
        optimizer['class'] = optim.RMSprop
        optimizer['params'] = dict(lr=args.learning_rate,
                                   alpha=args.decay,
                                   eps=args.epsilon,
                                   centered=True)
    else:
        raise ValueError

    # Settings
    if args.debug:
        train_frequency = 5
        evaluation_frequency = 50
        max_steps = 1000
    else:
        train_frequency = args.train_frequency
        evaluation_frequency = args.evaluation_frequency
        max_steps = args.max_steps

    # MDP
    mdp = Gym(args.env, 1000)

    net = getattr(networks, args.network)

    if args.algorithm == 'PPO':
        alg = PPO
        from mushroom_rl.policy import BoltzmannTorchPolicy
        policy_params = dict(
            n_features=args.n_features,
            extra_layer=args.extra_layer,
            act="tanh",
            use_cuda=torch.cuda.is_available()
        )
        algorithm_params = dict(
            batch_size=args.batch_size,
            actor_optimizer=optimizer,
            n_epochs_policy=10,
            eps_ppo=.2, lam=args.lam,
            critic_params=dict(network=net,
                               optimizer=optimizer,
                               loss=F.mse_loss,
                               n_features=args.n_features,
                               extra_layer=args.extra_layer,
                               act="lrelu",
                               batch_size=args.batch_size,
                               input_shape=mdp.info.observation_space.shape,
                               output_shape=(1,))
            )
        beta = Parameter(1.)
        policy = BoltzmannTorchPolicy(net, mdp.info.observation_space.shape,
                                      (mdp.info.action_space.n,),
                                      beta, **policy_params)

        agent = alg(mdp.info, policy, **algorithm_params)

    assert args.train_frequency >= args.n_steps_return
    logger = Logger(alg.__name__)
    logger.strong_line()
    logger.info('Experiment Algorithm: ' + alg.__name__)
    core = Core(agent, mdp)
    if len(scores) > 0:
        start_epoch = scores[-1]['epoch']
    else:
        start_epoch = 0

    for n_epoch in range(start_epoch + 1, max_steps // evaluation_frequency + 1):
        print_epoch(n_epoch, logger)
        logger.info('- Learning:')
        core.learn(n_steps=evaluation_frequency,
                       n_steps_per_fit=train_frequency, quiet=args.quiet)

        if n_epoch % 2 == 0:
            logger.info('- Evaluation:')
            dataset = core.evaluate(n_steps=10000, render=False, quiet=args.quiet)
            get_stats(dataset, logger, epoch=n_epoch)
    return scores

if __name__ == '__main__':
    main(None)
boris-il-forte commented 1 year ago

thanks for sharing your example. I'll try to fix the issue as soon as possible. Unfortunately, it might take a bit as we are currently quite busy. Thanks for the patience and for helping us to make mushroom better.

davidenitti commented 1 year ago

no problem, as I said I already fix it as describe before, but I'm just not sure the fix will work with continuous actions or other cases

boris-il-forte commented 1 year ago

@davidenitti the last commit in dev should fix the issue. The fix is very similar to the one you proposed, but directly wrapping the distribution, changing the behavior of log_prob, such that the fix only affects the Categorical distribution and not the others (e.g. Gaussian one).

Thank you very much for this bug report and sorry for the extremely slow response time. Unfortunately, we are pretty busy at the moment. feel free to re-open this bug report if you find further issues with the current solution.