cpnota / autonomous-learning-library

A PyTorch library for building deep reinforcement learning agents.
MIT License
646 stars 72 forks source link

Documentation on StochasticPolicy #136

Closed cisprague closed 4 years ago

cisprague commented 4 years ago

Hi, I am trying to use the PPO algorithm; however, it's not clear how to construct the stochastic policy. Should I use the Gaussian policy network?

Cool library by the way; I like the modularity!

cpnota commented 4 years ago

Hi!

It depends on the action space of your environment. If you have a continuous action space, I do recommend using GaussianPolicy. You're right, the Policy classes could use some additional documentation! You'll see GaussianPolicy takes tree required parameters: model, optimizer, and space. model should be any PyTorch model that takes an input shape corresponding to the state/observation space, and should output 2 * the number of action dimensions (one output corresponding to the mean and one corresponding to the std for each feature). The module will automatically normally these outputs to [space.low, space.high]. If you have a discrete state space, you should instead use SoftmaxPolicy. optimizer should be any Pytorch optimizer for the model (e.g., Adam), and space should be the gym.Space corresponding to the action space. See the presets, for continuous and atari.

If your environment fits, you might also be able to use the presets directly: from all.presets.continuous import ppo. If not, it may still be useful to use these files as starting points.

Let me know if you need anything else!

cisprague commented 4 years ago

Thanks for the quick response! Shouldn't the input shape to the GaussianPolicy be the shape of the output from the FeatureNetwork?

If I understand correctly, the arguments to PPO should map in the following way:

Is that correct?

This is my code:

    # instantiate Gym environment
    field = Field()
    model = Model()
    system = AUV(field, model)
    env = Deterministic(system)

    # feature network
    shape = [env.observation_space.shape[0], 400, 100]
    feature_model = torch.nn.Sequential(*[
        op for i in range(len(shape) - 1) for op in [
            torch.nn.Linear(shape[i], shape[i+1]),
            torch.nn.BatchNorm1d(shape[i+1]),
            torch.nn.ReLU()
        ]
    ])
    feature_optimiser = torch.optim.Adam(
        feature_model.parameters(),
        lr=1e-3
    )
    feature_network = FeatureNetwork(
        feature_model, feature_optimiser
    )

    # value network
    shape = [100, 100, 1]
    value_model = torch.nn.Sequential(*[
        op for i in range(len(shape) - 1) for op in [
            torch.nn.Linear(shape[i], shape[i+1]),
            torch.nn.BatchNorm1d(shape[i+1]),
            torch.nn.ReLU()
        ]
    ])
    value_optimiser = torch.optim.Adam(
        value_model.parameters(),
        lr=1e-3
    )
    value_network = VNetwork(
        value_model, value_optimiser
    )

    # policy network
    shape = [100, 100, 2*env.action_space.shape[0]]
    policy_model = torch.nn.Sequential(*[
        op for i in range(len(shape) - 1) for op in [
            torch.nn.Linear(shape[i], shape[i+1]),
            torch.nn.BatchNorm1d(shape[i+1]),
            torch.nn.ReLU()
        ]
    ])
    policy_optimiser = torch.optim.Adam(
        policy_model.parameters(),
        lr=1e-3
    )
    policy_network = GaussianPolicy(
        policy_model, policy_optimiser, env.action_space
    )

    # proximal policy optimisation agent
    agent = PPO(
        feature_network,
        value_network,
        policy_network,
        n_envs=1,
        n_steps=10000
    )

    # experiment
    SingleEnvExperiment(agent, env)

But, I get the following error:

Traceback (most recent call last):
  File "train.py", line 91, in <module>
    deterministic()
  File "train.py", line 84, in deterministic
    SingleEnvExperiment(agent, env)
  File "/usr/local/lib/python3.6/dist-packages/all/experiments/single_env_experiment.py", line 16, in __init__
    super().__init__(self._make_writer(agent.__name__, env.name, write_loss), quiet)
  File "/usr/local/lib/python3.6/dist-packages/all/optim/scheduler.py", line 6, in __getattribute__
    value = object.__getattribute__(self, name)
AttributeError: 'PPO' object has no attribute '__name__'
cpnota commented 4 years ago

Shouldn't the input shape to the GaussianPolicy be the shape of the output from the FeatureNetwork?

Ah, yes, you're correct. I'm a little skeptical of your network shape though; it looks like the last two layers in both the policy and value network are ReLU and batchnorm layers. Usually, you would want the final layer to be a Linear layer.

But, I get the following error:

The first parameter to SingleEnvExperiment should be a function that accepts the env and a writer. This is how we achieve dependency injection between the Experiment and the Agent. So you can replace the last few lines with:

def ppo(env, writer):
    return PPO(
        feature_network,
        value_network,
        policy_network,
        n_envs=1,
        n_steps=10000,
        writer=writer
    )

# create the experiment
experiment = SingleEnvExperiment(ppo, env)

# actually run the experiment
experiment.train(frames=100000)
returns = experiment.test(100)
print('returns: ', np.mean(returns))

However, PPO will work much better if you use multiple envs. The usage then will then instead look like:

n_envs = 10

def ppo(env, writer):
    return PPO(
        feature_network,
        value_network,
        policy_network,
        n_envs=10,
        n_steps=10000,
        writer=writer
    )

# create the experiment
experiment = ParallelEnvExperiment((ppo, n_envs), env)

# actually run the experiment
experiment.train(frames=100000)
returns = experiment.test(100)
print('returns: ', np.mean(returns))

Notice that ParallelEnvExperiment accepts the tuple, (make_agent, n_envs), as the first parameter. This is because n_envs is usually specified as a hyperparameter of the agent, so we need a way to communicate this back to the Experiment object.

If you want more logging information, you have to actually pass writer to each function approximator. To do this, we just need to move the constructors inside of ppo(env, writer):

def ppo(env, writer):
    feature_network = FeatureNetwork(
        feature_model, feature_optimiser, writer=writer
    )
    value_network = VNetwork(
        value_model, value_optimiser, writer=writer
    )
    policy_network = GaussianPolicy(
        policy_model, policy_optimiser, env.action_space, writer=writer
    )
    return PPO(
        feature_network,
        value_network,
        policy_network,
        n_envs=10,
        n_steps=10000,
        writer=writer
    )

Hope that helps!