Closed cisprague closed 4 years ago
Hi!
It depends on the action space of your environment. If you have a continuous action space, I do recommend using GaussianPolicy
. You're right, the Policy
classes could use some additional documentation! You'll see GaussianPolicy
takes tree required parameters: model
, optimizer
, and space
. model
should be any PyTorch model that takes an input shape corresponding to the state/observation space, and should output 2 * the number of action dimensions (one output corresponding to the mean and one corresponding to the std for each feature). The module will automatically normally these outputs to [space.low, space.high]. If you have a discrete state space, you should instead use SoftmaxPolicy
. optimizer
should be any Pytorch optimizer for the model (e.g., Adam), and space
should be the gym.Space
corresponding to the action space. See the presets, for continuous and atari.
If your environment fits, you might also be able to use the presets directly: from all.presets.continuous import ppo
. If not, it may still be useful to use these files as starting points.
Let me know if you need anything else!
Thanks for the quick response!
Shouldn't the input shape to the GaussianPolicy
be the shape of the output from the FeatureNetwork
?
If I understand correctly, the arguments to PPO
should map in the following way:
feature_network
maps observation_dim
to feature_dim
,value_network
maps feature_dim
to 1
,policy_network
maps feature_dim
to 2*action_dim
(stochastic case).Is that correct?
This is my code:
# instantiate Gym environment
field = Field()
model = Model()
system = AUV(field, model)
env = Deterministic(system)
# feature network
shape = [env.observation_space.shape[0], 400, 100]
feature_model = torch.nn.Sequential(*[
op for i in range(len(shape) - 1) for op in [
torch.nn.Linear(shape[i], shape[i+1]),
torch.nn.BatchNorm1d(shape[i+1]),
torch.nn.ReLU()
]
])
feature_optimiser = torch.optim.Adam(
feature_model.parameters(),
lr=1e-3
)
feature_network = FeatureNetwork(
feature_model, feature_optimiser
)
# value network
shape = [100, 100, 1]
value_model = torch.nn.Sequential(*[
op for i in range(len(shape) - 1) for op in [
torch.nn.Linear(shape[i], shape[i+1]),
torch.nn.BatchNorm1d(shape[i+1]),
torch.nn.ReLU()
]
])
value_optimiser = torch.optim.Adam(
value_model.parameters(),
lr=1e-3
)
value_network = VNetwork(
value_model, value_optimiser
)
# policy network
shape = [100, 100, 2*env.action_space.shape[0]]
policy_model = torch.nn.Sequential(*[
op for i in range(len(shape) - 1) for op in [
torch.nn.Linear(shape[i], shape[i+1]),
torch.nn.BatchNorm1d(shape[i+1]),
torch.nn.ReLU()
]
])
policy_optimiser = torch.optim.Adam(
policy_model.parameters(),
lr=1e-3
)
policy_network = GaussianPolicy(
policy_model, policy_optimiser, env.action_space
)
# proximal policy optimisation agent
agent = PPO(
feature_network,
value_network,
policy_network,
n_envs=1,
n_steps=10000
)
# experiment
SingleEnvExperiment(agent, env)
But, I get the following error:
Traceback (most recent call last):
File "train.py", line 91, in <module>
deterministic()
File "train.py", line 84, in deterministic
SingleEnvExperiment(agent, env)
File "/usr/local/lib/python3.6/dist-packages/all/experiments/single_env_experiment.py", line 16, in __init__
super().__init__(self._make_writer(agent.__name__, env.name, write_loss), quiet)
File "/usr/local/lib/python3.6/dist-packages/all/optim/scheduler.py", line 6, in __getattribute__
value = object.__getattribute__(self, name)
AttributeError: 'PPO' object has no attribute '__name__'
Shouldn't the input shape to the GaussianPolicy be the shape of the output from the FeatureNetwork?
Ah, yes, you're correct. I'm a little skeptical of your network shape though; it looks like the last two layers in both the policy and value network are ReLU and batchnorm layers. Usually, you would want the final layer to be a Linear
layer.
But, I get the following error:
The first parameter to SingleEnvExperiment should be a function that accepts the env
and a writer
. This is how we achieve dependency injection between the Experiment
and the Agent
. So you can replace the last few lines with:
def ppo(env, writer):
return PPO(
feature_network,
value_network,
policy_network,
n_envs=1,
n_steps=10000,
writer=writer
)
# create the experiment
experiment = SingleEnvExperiment(ppo, env)
# actually run the experiment
experiment.train(frames=100000)
returns = experiment.test(100)
print('returns: ', np.mean(returns))
However, PPO will work much better if you use multiple envs. The usage then will then instead look like:
n_envs = 10
def ppo(env, writer):
return PPO(
feature_network,
value_network,
policy_network,
n_envs=10,
n_steps=10000,
writer=writer
)
# create the experiment
experiment = ParallelEnvExperiment((ppo, n_envs), env)
# actually run the experiment
experiment.train(frames=100000)
returns = experiment.test(100)
print('returns: ', np.mean(returns))
Notice that ParallelEnvExperiment accepts the tuple, (make_agent, n_envs)
, as the first parameter. This is because n_envs is usually specified as a hyperparameter of the agent, so we need a way to communicate this back to the Experiment
object.
If you want more logging information, you have to actually pass writer
to each function approximator. To do this, we just need to move the constructors inside of ppo(env, writer)
:
def ppo(env, writer):
feature_network = FeatureNetwork(
feature_model, feature_optimiser, writer=writer
)
value_network = VNetwork(
value_model, value_optimiser, writer=writer
)
policy_network = GaussianPolicy(
policy_model, policy_optimiser, env.action_space, writer=writer
)
return PPO(
feature_network,
value_network,
policy_network,
n_envs=10,
n_steps=10000,
writer=writer
)
Hope that helps!
Hi, I am trying to use the PPO algorithm; however, it's not clear how to construct the stochastic policy. Should I use the Gaussian policy network?
Cool library by the way; I like the modularity!