Closed mclearning2 closed 5 years ago
Good job. I am gonna review this tonight.
Good. Would you assign reviewers and an assignee? And please check the wandb link. I can't watch that.
@MrSyee Thanks, I did it.
is it trained?
@Curt-Park, Surprisingly, Yes. I think that actions of dist.mean
affect the performance in the test.
@mclearning2
ppo_iter
method should be replaced to a new cell above PPOAgent. In addition, should write a little documentation about the operation of this method. And unless it is the member method of PPOAgent class, it is removed the table of summary.@MrSyee Thank you for the detailed review. I fixed all. During fixing the code, I found that my seed worked!! Because I didn't set the environment seed, the previous code didn't work. But I'm not sure others work. So please check this by just executing and watching the graph or values. (In menu Kernel
Restart & Run All
)
And I'm trying to find optimal parameters
Colab is here.
returns: Deque[float] = deque()
self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
next_value, self.rewards, self.masks, self.values, self.gamma, self.tau
self._plot(self.total_step, scores, actor_losses, critic_losses)
self.states: list
I was testing this code with modified parameters under the assumption we need large training steps for single on-policy agents:
However, Colab doesn't allow long-time executions. It is terminated during training. I will try this on the weekdays at work.
How about taking a look at this repository? It makes a successful result with PPO on Pendulum. You can check the parameters and additional techniques: https://github.com/adik993/ppo-pytorch
Thank you for sharing. @mclearning2 will check that repository and he will finish by September 29th as discussed at tonight. We will continue to share the progress here.
https://github.com/MrSyee/pg-is-all-you-need/pull/7#issuecomment-533856286
The parameters don't work well, even though the loss decreases.
Please change the file name "02.PPO.ipynb"
@Curt-Park I ran the jupyter-black. However, the colab line ruler is 80, but black limits 88. Which limitation would be okay?
@mclearning2 we should follow the default settings of Colab. It's 80.
Thank you for sharing. However, we want to implement the agent interacting with a single environment. In ppo-pytorch, it uses ICM and MultiEnv. I'm not sure the parameters and additional techniques help our work, But I'll consider to use it.
I changed the model a little like A2C but the only difference is that standard deviation is constant(1). As a result, it was great. And I'm finding optimal parameters
def initialize_uniformly(layer: nn.Linear, init_w: float = 3e-3):
"""Initialize the weights and bias in [-init_w, init_w]."""
layer.weight.data.uniform_(-init_w, init_w)
layer.bias.data.uniform_(-init_w, init_w)
class Actor(nn.Module):
def __init__(self, in_dim: int, out_dim: int):
"""Initialization."""
super(Actor, self).__init__()
self.hidden1 = nn.Linear(in_dim, 64)
self.mu_layer = nn.Linear(64, out_dim)
self.log_std_layer = nn.Linear(64, out_dim)
initialize_uniformly(self.mu_layer)
initialize_uniformly(self.log_std_layer)
def forward(self, state: torch.Tensor) -> torch.Tensor:
"""Forward method implementation."""
x = F.relu(self.hidden1(state))
mu = torch.tanh(self.mu_layer(x)) * 2
std = torch.ones_like(mu)
dist = Normal(mu, std)
action = dist.sample().clamp(-2, 2)
return action, dist
@mclearning2 we can set the configuration same. process num = 1 and NoCuriosity.factory()
The fastest approach is running other's implementation which performs better.
agent = PPO(MultiEnv('Pendulum-v0', 1, reporter),
reporter=reporter,
normalize_state=True,
normalize_reward=True,
model_factory=MLP.factory(),
curiosity_factory=NoCuriosity.factory(),
reward=GeneralizedRewardEstimation(gamma=0.95, lam=0.15),
advantage=GeneralizedAdvantageEstimation(gamma=0.95, lam=0.15),
learning_rate=4e-4,
clip_range=0.3,
v_clip_range=0.5,
c_entropy=1e-2,
c_value=0.5,
n_mini_batches=32,
n_optimization_epochs=10,
clip_grad_norm=0.5)
Oh I didn't think of it. I will apply the several techniques on Friday. For now I can't do that because of personal thing.
@Curt-Park I tried to run the ppo-pytorch as you recommended. As a result, the parameters that they set output similar to what I did. It means that Sometimes it's good, sometimes it's bad in the test.
I changed the parameters that were proper that I thought.
agent = PPO(MultiEnv('Pendulum-v0', 1, reporter),
reporter=reporter,
normalize_state=True,
normalize_reward=True,
model_factory=MLP.factory(),
curiosity_factory=NoCuriosity.factory(),
reward=GeneralizedRewardEstimation(gamma=0.95, lam=0.95),
advantage=GeneralizedAdvantageEstimation(gamma=0.95, lam=0.95),
learning_rate=0.001,
clip_range=0.2,
v_clip_range=0.5,
c_entropy=0.005,
c_value=0.5,
n_mini_batches=32,
n_optimization_epochs=10,
clip_grad_norm=0.5)
However, It was worse. I'm trying to find good parameters, However, It's difficult to find good parameters, as I did in my work.
For now, My work has a high variance like the above result.
During the experiments I noticed that this variance increases and decreases with the size of the hidden layer. Larger A increases variance and smaller A decreases variance but decreases score mean. It may be natural, we need delicate adjustment for the optimal tradeoff.
@mclearning2 significant progress. Keep going on. :)
I’m trying to find optimal PPO. I applied clipping value or model parameters. It makes the performance worse or nothing.
Constraining the output range of log std could enhance the performance. See mlp_gaussian_policy function.
https://github.com/openai/spinningup/blob/master/spinup/algos/sac/core.py
I tried constraining the output range of log std.
class Actor(nn.Module):
def __init__(
self,
in_dim: int,
out_dim: int,
log_std_min: int = -20,
log_std_max: int = 2
):
"""Initialize."""
super(Actor, self).__init__()
self.log_std_min = log_std_min
self.log_std_max = log_std_max
self.hidden = nn.Linear(in_dim, 128)
self.mu_layer = nn.Linear(128, out_dim)
self.mu_layer = init_layer_uniform(self.mu_layer)
self.log_std_layer = nn.Linear(128, out_dim)
self.log_std_layer = init_layer_uniform(self.log_std_layer)
def forward(self, state: torch.Tensor) -> torch.Tensor:
"""Forward method implementation."""
x = F.relu(self.hidden(state))
mu = torch.tanh(self.mu_layer(x)) * 2
log_std = torch.tanh(self.log_std_layer(x))
log_std = self.log_std_min + 0.5 * (
self.log_std_max - self.log_std_min
) * (log_std + 1)
std = torch.exp(log_std)
dist = Normal(mu, std)
action = dist.sample()
return action, dist
Test Results (10 times):
score: [[-141.70344603]] score: [[-1107.34639649]] score: [[-349.53521275]] score: [[-1036.1588195]] score: [[-2.22322292]] score: [[-2.68140979]] score: [[-5.58217162]] score: [[-133.22630113]] score: [[-1019.5034467]] score: [[-118.67785692]]
I experimented it as same
class Actor(nn.Module):
def __init__(
self,
in_dim: int,
out_dim: int,
log_std_min: int = -20,
log_std_max: int = 2
):
"""Initialization."""
super(Actor, self).__init__()
self.log_std_min = log_std_min
self.log_std_max = log_std_max
self.hidden1 = nn.Linear(in_dim, 32)
self.hidden2 = nn.Linear(32, 32)
self.mu_layer = nn.Linear(32, out_dim)
self.mu_layer = init_layer_uniform(self.mu_layer)
self.log_std_layer = nn.Linear(32, out_dim)
self.log_std_layer = init_layer_uniform(self.log_std_layer)
def forward(self, state: torch.Tensor) -> torch.Tensor:
"""Forward method implementation."""
x = F.relu(self.hidden1(state))
x = F.relu(self.hidden2(x))
mu = torch.tanh(self.mu_layer(x)) * 2
log_std = torch.tanh(self.log_std_layer(x))
log_std = self.log_std_min + 0.5 * (
self.log_std_max - self.log_std_min
) * (log_std + 1)
std = torch.exp(log_std)
dist = Normal(mu, std)
action = dist.sample()
return action, dist
I guess the training goes well, but the network capacity is not enough. I will try again with a thinner and deeper network. Stay tuned.
Score -1000 may not be bad. Because I saw the play that finally the bar is up, it just needs time to be up, but score is down. It means that PPO agent is trained, but not perfect.
And I tried model with thinner and deeper. It doesn’t make the performance well, but learning faster or not.
@mclearning2 It will be so helpful to understand your progress if you provide the settings and results you tried. We should share the insights. Without sharing, others may spend time for the same trial and errors, and it may not help you.
I pushed the current setting and result. It is applied value clipping and removed gaussian distribution.
I tried the following settings: higher entropy.
class Actor(nn.Module):
def __init__(
self,
in_dim: int,
out_dim: int,
log_std_min: int = -20,
log_std_max: int = 2
):
"""Initialization."""
super(Actor, self).__init__()
self.log_std_min = log_std_min
self.log_std_max = log_std_max
self.hidden = nn.Linear(in_dim, 128)
self.mu_layer = nn.Linear(128, out_dim)
self.mu_layer = init_layer_uniform(self.mu_layer)
self.log_std_layer = nn.Linear(128, out_dim)
self.log_std_layer = init_layer_uniform(self.log_std_layer)
def forward(self, state: torch.Tensor) -> torch.Tensor:
"""Forward method implementation."""
x = F.relu(self.hidden(state))
mu = torch.tanh(self.mu_layer(x)) * 2
log_std = torch.tanh(self.log_std_layer(x))
log_std = self.log_std_min + 0.5 * (
self.log_std_max - self.log_std_min
) * (log_std + 1)
std = torch.exp(log_std)
dist = Normal(mu, std)
action = dist.sample()
return action, dist
# parameters
num_frames = 150000
agent = PPOAgent(
env,
gamma = 0.96,
tau = 0.9,
batch_size = 64,
epsilon = 0.2,
epoch = 128,
rollout_len = 128,
entropy_weight = 0.01
)
Higher entropy weight makes a similar effect as we use parallel environments with multi-processing. I tried 0.01 and will try 0.1 as well.
Here is the result of 0.1 entropy. The test result is not so bad.
score: [[-130.24481804]]
score: [[-1396.97588997]]
score: [[-1494.55385998]]
score: [[-115.96359802]]
score: [[-4.16239945]]
score: [[-4.42382684]]
score: [[-7.16646373]]
score: [[-129.49037978]]
score: [[-1495.46961907]]
score: [[-1285.18972822]]
score: [[-1302.86093418]]
score: [[-124.52728289]]
score: [[-1004.63473437]]
score: [[-130.21244864]]
score: [[-1454.02280859]]
score: [[-241.05102416]]
score: [[-1367.29442738]]
score: [[-1190.38101372]]
score: [[-961.35308555]]
score: [[-1506.66761708]]
score: [[-1492.17374091]]
score: [[-3.96000771]]
score: [[-8.54655092]]
score: [[-1191.60825215]]
score: [[-1517.73246221]]
score: [[-3.72562266]]
score: [[-1509.36923303]]
score: [[-4.7225913]]
score: [[-1126.90787808]]
score: [[-125.59924837]]
score: [[-129.35825879]]
score: [[-1495.04614324]]
score: [[-1491.02112512]]
score: [[-127.37643363]]
score: [[-1503.07977651]]
score: [[-1494.94831417]]
score: [[-4.44921447]]
score: [[-3.99907054]]
score: [[-1497.35393934]]
score: [[-1493.71208883]]
score: [[-128.47768558]]
score: [[-1402.3079256]]
score: [[-5.67091663]]
score: [[-1060.81727375]]
score: [[-3.77974653]]
score: [[-1194.68542529]]
score: [[-3.911968]]
score: [[-1504.8581121]]
score: [[-130.01183786]]
score: [[-240.59622883]]
score: [[-3.99122673]]
score: [[-1205.37393231]]
score: [[-128.91650906]]
score: [[-976.77554156]]
score: [[-947.14712517]]
score: [[-1217.00858539]]
score: [[-1273.9406424]]
score: [[-4.15917789]]
score: [[-127.34517948]]
score: [[-1323.52895088]]
score: [[-4.19671494]]
score: [[-1512.27547469]]
score: [[-1361.18254198]]
score: [[-130.68846096]]
score: [[-5.53557083]]
score: [[-127.56373857]]
score: [[-1393.94396644]]
score: [[-1520.15532765]]
score: [[-129.61470288]]
score: [[-4.25860515]]
score: [[-5.22567911]]
score: [[-1499.38414266]]
score: [[-1203.76151694]]
score: [[-1281.68922802]]
score: [[-1199.32929212]]
score: [[-1497.37679639]]
score: [[-1452.56743898]]
score: [[-130.49212716]]
score: [[-1509.44556013]]
score: [[-1338.29089836]]
score: [[-1464.95227618]]
score: [[-1494.74179155]]
score: [[-1504.40542729]]
score: [[-1365.64487701]]
score: [[-131.80774829]]
score: [[-3.99753933]]
score: [[-972.16535748]]
score: [[-1519.03465733]]
score: [[-1493.91187715]]
score: [[-1124.78690286]]
score: [[-1494.3810296]]
score: [[-1270.35122268]]
score: [[-1500.52186423]]
score: [[-117.59006587]]
score: [[-1304.48454976]]
score: [[-1210.65874721]]
score: [[-1491.19755902]]
score: [[-122.80648182]]
score: [[-1496.96335039]]
score: [[-124.58070664]]
I will try 0.01 entropy again with longer steps (200,000) because I missed checking the test results due to the session termination on Colab.
Small modification on your recent settings. We should see the test result because the agent explores during training due to entropy maximization.
class Actor(nn.Module):
def __init__(
self,
in_dim: int,
out_dim: int,
log_std_min: int = -20,
log_std_max: int = 2
):
"""Initialization."""
super(Actor, self).__init__()
self.log_std_min = log_std_min
self.log_std_max = log_std_max
self.hidden = nn.Linear(in_dim, 32) # small size hidden layer
self.mu_layer = nn.Linear(32, out_dim)
self.mu_layer = init_layer_uniform(self.mu_layer)
self.log_std_layer = nn.Linear(32, out_dim)
self.log_std_layer = init_layer_uniform(self.log_std_layer)
def forward(self, state: torch.Tensor) -> torch.Tensor:
"""Forward method implementation."""
x = F.relu(self.hidden(state))
mu = torch.tanh(self.mu_layer(x)) * 2
log_std = torch.tanh(self.log_std_layer(x)) # log_std range: (-20, 2)
log_std = self.log_std_min + 0.5 * (
self.log_std_max - self.log_std_min
) * (log_std + 1)
std = torch.exp(log_std)
dist = Normal(mu, std)
action = dist.sample()
return action, dist
# parameters
num_frames = 200000
agent = PPOAgent(
env,
gamma = 0.8,
tau = 0.8,
batch_size = 64,
epsilon = 0.2,
epoch = 64,
rollout_len = 128,
entropy_weight = 0.01 # higher entropy weight for lower variance.
)
score: [[-431.03342891]]
score: [[-1398.13945718]]
score: [[-276.60684344]]
score: [[-277.1123654]]
score: [[-132.24809792]]
score: [[-1492.72885617]]
score: [[-1.64320287]]
score: [[-136.26063434]]
score: [[-2.84782626]]
score: [[-1.78384495]]
score: [[-406.69635635]]
score: [[-569.23581057]]
score: [[-274.53529855]]
score: [[-591.82886159]]
score: [[-542.36173822]]
score: [[-1494.40737499]]
score: [[-433.59872954]]
score: [[-1493.65519691]]
score: [[-1494.25849291]]
score: [[-1499.44006854]]
score: [[-137.01595062]]
score: [[-2.18412965]]
score: [[-580.94835207]]
score: [[-442.05376189]]
score: [[-582.66045183]]
score: [[-134.85339964]]
score: [[-1491.09704244]]
score: [[-132.93166177]]
score: [[-415.54227379]]
score: [[-133.90644976]]
score: [[-434.34562648]]
score: [[-705.74521929]]
score: [[-424.30055892]]
score: [[-135.42478738]]
score: [[-411.75474363]]
score: [[-411.11488032]]
score: [[-596.72162925]]
score: [[-2.61287619]]
score: [[-138.76259517]]
score: [[-743.05337237]]
score: [[-1.73530657]]
score: [[-134.00991072]]
score: [[-424.60862911]]
score: [[-272.65133305]]
score: [[-637.28189951]]
score: [[-719.26566452]]
score: [[-135.17179552]]
score: [[-1.63010459]]
score: [[-276.10223759]]
score: [[-435.17274083]]
score: [[-132.56283901]]
score: [[-408.90098852]]
score: [[-138.52051094]]
score: [[-1491.96905612]]
score: [[-135.96720586]]
score: [[-132.49526984]]
score: [[-555.49813357]]
score: [[-134.33965573]]
score: [[-710.70227035]]
score: [[-406.20367026]]
score: [[-133.65046516]]
score: [[-432.78668079]]
score: [[-1492.46580586]]
score: [[-562.54522204]]
score: [[-268.74393411]]
score: [[-1492.60341698]]
score: [[-140.24355777]]
score: [[-1495.55961206]]
score: [[-269.42414056]]
score: [[-134.7417334]]
score: [[-558.09156085]]
score: [[-271.69486058]]
score: [[-1.95757357]]
score: [[-414.47218197]]
score: [[-135.72825472]]
score: [[-132.86002044]]
score: [[-266.45104296]]
score: [[-137.37845622]]
score: [[-404.61086097]]
score: [[-556.45957152]]
score: [[-414.24664247]]
score: [[-565.97784959]]
score: [[-266.68150046]]
score: [[-5.71920144]]
score: [[-574.94767719]]
score: [[-267.87260196]]
score: [[-776.16529309]]
score: [[-1497.45062946]]
score: [[-1498.74138661]]
score: [[-539.14962369]]
score: [[-134.29569324]]
score: [[-133.91062409]]
score: [[-280.75160116]]
score: [[-1.68397768]]
score: [[-132.79134791]]
score: [[-1497.85922345]]
score: [[-596.4199929]]
score: [[-1493.18102822]]
score: [[-273.53954489]]
score: [[-3.02940193]]
It looks good.
Thank you for your help. I found the std makes the variance, So I fixed the std by 0.5. As you did, I used high entropy by 0.05. I think it makes better too
class Actor(nn.Module):
def __init__(self, in_dim: int, out_dim: int):
"""Initialization."""
super(Actor, self).__init__()
self.hidden = nn.Linear(in_dim, 28)
self.mu_layer = nn.Linear(28, out_dim)
self.mu_layer = init_layer_uniform(self.mu_layer)
self.log_std_layer = nn.Linear(28, out_dim)
self.log_std_layer = init_layer_uniform(self.log_std_layer)
def forward(self, state: torch.Tensor) -> torch.Tensor:
"""Forward method implementation."""
x = F.relu(self.hidden(state))
mu = torch.tanh(self.mu_layer(x))
std = torch.ones_like(mu) / 2
dist = Normal(mu, std)
action = dist.sample()
return action, dist
class Critic(nn.Module):
def __init__(self, in_dim: int):
"""Initialization."""
super(Critic, self).__init__()
self.hidden = nn.Linear(in_dim, 80)
self.out = nn.Linear(80, 1)
self.out = init_layer_uniform(self.out)
def forward(self, state: torch.Tensor) -> torch.Tensor:
"""Forward method implementation."""
x = F.relu(self.hidden(state))
value = self.out(x)
return value
# parameters
num_frames = 300000
agent = PPOAgent(
env,
gamma = 0.9,
tau = 0.8,
batch_size = 64,
epsilon = 0.2,
epoch = 64,
rollout_len = 128,
entropy_weight = 0.05
)
@mclearning2 This is a tutorial, so we should take a more general and desirable approach that people can take for other problems as well. Fixed std has several drawbacks under the hood.
I recommend you to use a learnable std model and tune the upper bound of log_std (lower upper bound).
@Curt-Park, I discussed with @MrSyee. The conclusion is to try log_std with the bound. If the bound doesn't work, we think there are two choices.
I will PR today about the result of bounded log_std. Thank you for your recommend.
@mclearning2 It will show a similar performance if you set a narrow std range surrounding 0.5. Fixed std is just a subset of std within a range. Cheers.
@Curt-Park Yeah, I think that is a valid opinion. Thanks :)
FYI
The action is normalized within [-1, 1], so we can more reduce max_log_std.
Yeah, It was similar for me. You're right. Thanks!
class Actor(nn.Module):
def __init__(
self,
in_dim: int,
out_dim: int,
log_std_min: int = -20,
log_std_max: int = 0
):
"""Initialization."""
super(Actor, self).__init__()
self.log_std_min = log_std_min
self.log_std_max = log_std_max
self.hidden = nn.Linear(in_dim, 32) # small size hidden layer
self.mu_layer = nn.Linear(32, out_dim)
self.mu_layer = init_layer_uniform(self.mu_layer)
self.log_std_layer = nn.Linear(32, out_dim)
self.log_std_layer = init_layer_uniform(self.log_std_layer)
def forward(self, state: torch.Tensor) -> torch.Tensor:
"""Forward method implementation."""
x = F.relu(self.hidden(state))
mu = torch.tanh(self.mu_layer(x)) * 2
log_std = torch.tanh(self.log_std_layer(x))
log_std = self.log_std_min + 0.5 * (
self.log_std_max - self.log_std_min
) * (log_std + 1)
std = torch.exp(log_std)
dist = Normal(mu, std)
action = dist.sample()
return action, dist
env,
gamma = 0.9,
tau = 0.8,
batch_size = 64,
epsilon = 0.2,
epoch = 64,
rollout_len = 256,
entropy_weight = 0.005
@mclearning2 It is not safe from the sigmoid saturation issue. If the value goes to 0 during a long time learning, the agent will die.
Add a very small value, e.g. 1e-7.
@Curt-Park I changed the model like you and edited it.
I'll tune the optimal parameters and finish this PR by today.
@mclearning2 Good luck
rollout_len
helps to reduce variance, but it takes a longer time.
class Actor(nn.Module):
def __init__(
self,
in_dim: int,
out_dim: int,
log_std_min: int = -20,
log_std_max: int = 0
):
"""Initialization."""
super(Actor, self).__init__()
self.log_std_min = log_std_min
self.log_std_max = log_std_max
self.hidden = nn.Linear(in_dim, 32)
self.mu_layer = nn.Linear(32, out_dim)
self.mu_layer = init_layer_uniform(self.mu_layer)
self.log_std_layer = nn.Linear(32, out_dim)
self.log_std_layer = init_layer_uniform(self.log_std_layer)
def forward(self, state: torch.Tensor) -> torch.Tensor:
"""Forward method implementation."""
x = F.relu(self.hidden(state))
mu = torch.tanh(self.mu_layer(x))
log_std = torch.tanh(self.log_std_layer(x))
log_std = self.log_std_min + 0.5 * (
self.log_std_max - self.log_std_min
) * (log_std + 1)
std = torch.exp(log_std)
dist = Normal(mu, std)
action = dist.sample()
return action, dist
class Critic(nn.Module):
def __init__(self, in_dim: int):
"""Initialization."""
super(Critic, self).__init__()
self.hidden = nn.Linear(in_dim, 64)
self.out = nn.Linear(64, 1)
self.out = init_layer_uniform(self.out)
def forward(self, state: torch.Tensor) -> torch.Tensor:
"""Forward method implementation."""
x = F.relu(self.hidden(state))
value = self.out(x)
return value
agent = PPOAgent(
env,
gamma = 0.9,
tau = 0.8,
batch_size = 64,
epsilon = 0.2,
epoch = 64,
rollout_len = 256,
entropy_weight = 0.005
)
Significant results. PPO is yours now.
Please add comma after this line. log_std_max: int = 0
Great! I fix a little typo and approve this PR. Thank you!
I FINALLY MADE IT!! Previously, no matter how long the agent learned, it wasn't improved. It turned out that this was because of
shape
. So I changed the code like this.I think we can implement it better than this code. But I don't know how to do it. And I was totally tired of it.
And I figure out that setting seed needs to be just below importing. But in Jupyter, it doesn't work. You can check the python file and the result of it.
Here is Colab