Add PPO - Githubissues

mclearning2 commented 5 years ago

I FINALLY MADE IT!! Previously, no matter how long the agent learned, it wasn't improved. It turned out that this was because of shape. So I changed the code like this.

def step(self, action: np.ndarray) -> Tuple[np.ndarray, np.float64, bool]:
        """Take an action and return the response of the env."""
        next_state, reward, done, _ = self.env.step(action)
        next_state = np.reshape(next_state, (1, -1)).astype(np.float64)   # added
        reward = np.reshape(reward, (1, -1)).astype(np.float64)     # added
        done = np.reshape(done, (1, -1))     # added

        if not self.is_test:
            self.rewards.append(torch.FloatTensor(reward).to(self.device))
            self.masks.append(torch.FloatTensor(1 - done).to(self.device))

        return next_state, reward, done

state = self.env.reset()
state = np.expand_dims(state, axis=0)

I think we can implement it better than this code. But I don't know how to do it. And I was totally tired of it.

And I figure out that setting seed needs to be just below importing. But in Jupyter, it doesn't work. You can check the python file and the result of it.

Here is Colab

Curt-Park commented 5 years ago

Good job. I am gonna review this tonight.

MrSyee commented 5 years ago

Good. Would you assign reviewers and an assignee? And please check the wandb link. I can't watch that.

mclearning2 commented 5 years ago

@MrSyee Thanks, I did it.

Curt-Park commented 5 years ago

is it trained?

mclearning2 commented 5 years ago

@Curt-Park, Surprisingly, Yes. I think that actions of dist.mean affect the performance in the test.

MrSyee commented 5 years ago

@mclearning2

In documentation about PPO, you should write simply a description of TRPO or add a link of TRPO.
typo:
- If the advantage is negative, the objective will ~~decrease, AS~~ decrease. As a result, the action becomes less likely.
In documentation about GAE, should fix:
- GAE help ~~~. Please see the paper (<- it is a hyperlink) ~~
ppo_iter method should be replaced to a new cell above PPOAgent. In addition, should write a little documentation about the operation of this method. And unless it is the member method of PPOAgent class, it is removed the table of summary.
Fix document of every class: Initialization. -> Initialize.
Just, in my opinion, total loss can be removed because of exsistng actor loss and critic loss.
It is almost success in a testing phase, However, Its performance is not good compared A2C, DDPG... We discuss this problem.
Please colab link.

mclearning2 commented 5 years ago

@MrSyee Thank you for the detailed review. I fixed all. During fixing the code, I found that my seed worked!! Because I didn't set the environment seed, the previous code didn't work. But I'm not sure others work. So please check this by just executing and watching the graph or values. (In menu Kernel Restart & Run All)

And I'm trying to find optimal parameters

Colab is here.

Graph is same

Curt-Park commented 5 years ago

problem with a first-order methods => problem with a first-order methods
If the advantage is positive, the objective will increase. As a result, the action becomes more likely. If advantage is negative, the objective will decrease. AS a result, the action becomes less likely. => If the advantage is positive, the objective will increase, so, as a result, the action becomes more likely. On the other hand, if the advantage is negative, the objective will decrease, and the action becomes less likely.
In order to fit the range, the actor outputs the mean value with tanh. The result will be scaled in ActionNormalizer class. => In order to fit the range, the actor outputs the mean value with tanh, and the results will be scaled in ActionNormalizer class.
GAE help to reduce variance while maintaining a proper level of bias. By adjusting parameters 𝜆∈[0,1] and 𝛾∈[0,1] . Please see the paper => GAE helps to reduce variance while maintaining a proper level of bias by adjusting parameters: 𝜆∈[0,1] and 𝛾∈[0,1]. Please see the paper for more details.
Use Deque in typing module. returns: Deque[float] = deque()
The line is too long. delta = rewards[step] + gamma values[step + 1] masks[step] - values[step]
It yield the samples of stacked memory by interacting a environment. => It yields ~~the~~ samples from the memory stacked by interacting the environment.
too long lines self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu") next_value, self.rewards, self.masks, self.values, self.gamma, self.tau self._plot(self.total_step, scores, actor_losses, critic_losses)
use List for typing. self.states: list
You don't have to retain the graph unless you repeatedly use the loss.

Curt-Park commented 5 years ago

I was testing this code with modified parameters under the assumption we need large training steps for single on-policy agents:

lr: 1e-4
num_frames: 1,000,000 (super long training)
gamma: 0.99,
tau = 0.95,
batch_size = 32,
epsilon = 0.2,
epoch = 10,
rollout_len = 1024,
entropy_weight = 1e-3

However, Colab doesn't allow long-time executions. It is terminated during training. I will try this on the weekdays at work.

Curt-Park commented 5 years ago

How about taking a look at this repository? It makes a successful result with PPO on Pendulum. You can check the parameters and additional techniques: https://github.com/adik993/ppo-pytorch

MrSyee commented 5 years ago

Thank you for sharing. @mclearning2 will check that repository and he will finish by September 29th as discussed at tonight. We will continue to share the progress here.

Curt-Park commented 5 years ago

https://github.com/MrSyee/pg-is-all-you-need/pull/7#issuecomment-533856286

The parameters don't work well, even though the loss decreases.

MrSyee commented 5 years ago

Please change the file name "02.PPO.ipynb"

mclearning2 commented 5 years ago

@Curt-Park I ran the jupyter-black. However, the colab line ruler is 80, but black limits 88. Which limitation would be okay?

Curt-Park commented 5 years ago

@mclearning2 we should follow the default settings of Colab. It's 80.

mclearning2 commented 5 years ago

Thank you for sharing. However, we want to implement the agent interacting with a single environment. In ppo-pytorch, it uses ICM and MultiEnv. I'm not sure the parameters and additional techniques help our work, But I'll consider to use it.

I changed the model a little like A2C but the only difference is that standard deviation is constant(1). As a result, it was great. And I'm finding optimal parameters

def initialize_uniformly(layer: nn.Linear, init_w: float = 3e-3):
    """Initialize the weights and bias in [-init_w, init_w]."""
    layer.weight.data.uniform_(-init_w, init_w)
    layer.bias.data.uniform_(-init_w, init_w)

class Actor(nn.Module):
    def __init__(self, in_dim: int, out_dim: int):
        """Initialization."""
        super(Actor, self).__init__()

        self.hidden1 = nn.Linear(in_dim, 64)
        self.mu_layer = nn.Linear(64, out_dim)     
        self.log_std_layer = nn.Linear(64, out_dim)   

        initialize_uniformly(self.mu_layer)
        initialize_uniformly(self.log_std_layer)

    def forward(self, state: torch.Tensor) -> torch.Tensor:
        """Forward method implementation."""
        x = F.relu(self.hidden1(state))

        mu = torch.tanh(self.mu_layer(x)) * 2
        std = torch.ones_like(mu)

        dist = Normal(mu, std)
        action = dist.sample().clamp(-2, 2)

        return action, dist

Curt-Park commented 5 years ago

@mclearning2 we can set the configuration same. process num = 1 and NoCuriosity.factory()

The fastest approach is running other's implementation which performs better.

    agent = PPO(MultiEnv('Pendulum-v0', 1, reporter),
                reporter=reporter,
                normalize_state=True,
                normalize_reward=True,
                model_factory=MLP.factory(),
                curiosity_factory=NoCuriosity.factory(),
                reward=GeneralizedRewardEstimation(gamma=0.95, lam=0.15),
                advantage=GeneralizedAdvantageEstimation(gamma=0.95, lam=0.15),
                learning_rate=4e-4,
                clip_range=0.3,
                v_clip_range=0.5,
                c_entropy=1e-2,
                c_value=0.5,
                n_mini_batches=32,
                n_optimization_epochs=10,
                clip_grad_norm=0.5)

mclearning2 commented 5 years ago

Oh I didn't think of it. I will apply the several techniques on Friday. For now I can't do that because of personal thing.

mclearning2 commented 5 years ago

@Curt-Park I tried to run the ppo-pytorch as you recommended. As a result, the parameters that they set output similar to what I did. It means that Sometimes it's good, sometimes it's bad in the test.

I changed the parameters that were proper that I thought.

agent = PPO(MultiEnv('Pendulum-v0', 1, reporter),
                reporter=reporter,
                normalize_state=True,
                normalize_reward=True,
                model_factory=MLP.factory(),
                curiosity_factory=NoCuriosity.factory(),
                reward=GeneralizedRewardEstimation(gamma=0.95, lam=0.95),
                advantage=GeneralizedAdvantageEstimation(gamma=0.95, lam=0.95),
                learning_rate=0.001,
                clip_range=0.2,
                v_clip_range=0.5,
                c_entropy=0.005,
                c_value=0.5,
                n_mini_batches=32,
                n_optimization_epochs=10,
                clip_grad_norm=0.5)

However, It was worse. I'm trying to find good parameters, However, It's difficult to find good parameters, as I did in my work.

For now, My work has a high variance like the above result.

During the experiments I noticed that this variance increases and decreases with the size of the hidden layer. Larger A increases variance and smaller A decreases variance but decreases score mean. It may be natural, we need delicate adjustment for the optimal tradeoff.

Curt-Park commented 5 years ago

@mclearning2 significant progress. Keep going on. :)

mclearning2 commented 5 years ago

I’m trying to find optimal PPO. I applied clipping value or model parameters. It makes the performance worse or nothing.

Curt-Park commented 5 years ago

Constraining the output range of log std could enhance the performance. See mlp_gaussian_policy function.

https://github.com/openai/spinningup/blob/master/spinup/algos/sac/core.py

Curt-Park commented 5 years ago

I tried constraining the output range of log std.

class Actor(nn.Module):
    def __init__(
        self, 
        in_dim: int, 
        out_dim: int, 
        log_std_min: int = -20, 
        log_std_max: int = 2
    ):
        """Initialize."""
        super(Actor, self).__init__()

        self.log_std_min = log_std_min
        self.log_std_max = log_std_max
        self.hidden = nn.Linear(in_dim, 128)

        self.mu_layer = nn.Linear(128, out_dim)
        self.mu_layer = init_layer_uniform(self.mu_layer)

        self.log_std_layer = nn.Linear(128, out_dim)
        self.log_std_layer = init_layer_uniform(self.log_std_layer)

    def forward(self, state: torch.Tensor) -> torch.Tensor:
        """Forward method implementation."""
        x = F.relu(self.hidden(state))

        mu = torch.tanh(self.mu_layer(x)) * 2
        log_std = torch.tanh(self.log_std_layer(x))
        log_std = self.log_std_min + 0.5 * (
            self.log_std_max - self.log_std_min
        ) * (log_std + 1)
        std = torch.exp(log_std)

        dist = Normal(mu, std)
        action = dist.sample()

        return action, dist

download

Test Results (10 times):

score: [[-141.70344603]] score: [[-1107.34639649]] score: [[-349.53521275]] score: [[-1036.1588195]] score: [[-2.22322292]] score: [[-2.68140979]] score: [[-5.58217162]] score: [[-133.22630113]] score: [[-1019.5034467]] score: [[-118.67785692]]

mclearning2 commented 5 years ago

I experimented it as same

Curt-Park commented 5 years ago

class Actor(nn.Module):
    def __init__(
        self, 
        in_dim: int, 
        out_dim: int, 
        log_std_min: int = -20, 
        log_std_max: int = 2
    ):
        """Initialization."""
        super(Actor, self).__init__()

        self.log_std_min = log_std_min
        self.log_std_max = log_std_max
        self.hidden1 = nn.Linear(in_dim, 32)
        self.hidden2 = nn.Linear(32, 32)

        self.mu_layer = nn.Linear(32, out_dim)
        self.mu_layer = init_layer_uniform(self.mu_layer)

        self.log_std_layer = nn.Linear(32, out_dim)
        self.log_std_layer = init_layer_uniform(self.log_std_layer)

    def forward(self, state: torch.Tensor) -> torch.Tensor:
        """Forward method implementation."""
        x = F.relu(self.hidden1(state))
        x = F.relu(self.hidden2(x))

        mu = torch.tanh(self.mu_layer(x)) * 2
        log_std = torch.tanh(self.log_std_layer(x))
        log_std = self.log_std_min + 0.5 * (
            self.log_std_max - self.log_std_min
        ) * (log_std + 1)
        std = torch.exp(log_std)

        dist = Normal(mu, std)
        action = dist.sample()

        return action, dist

I guess the training goes well, but the network capacity is not enough. I will try again with a thinner and deeper network. Stay tuned.

mclearning2 commented 5 years ago

Score -1000 may not be bad. Because I saw the play that finally the bar is up, it just needs time to be up, but score is down. It means that PPO agent is trained, but not perfect.

And I tried model with thinner and deeper. It doesn’t make the performance well, but learning faster or not.

Curt-Park commented 5 years ago

@mclearning2 It will be so helpful to understand your progress if you provide the settings and results you tried. We should share the insights. Without sharing, others may spend time for the same trial and errors, and it may not help you.

mclearning2 commented 5 years ago

I pushed the current setting and result. It is applied value clipping and removed gaussian distribution.

Curt-Park commented 5 years ago

I tried the following settings: higher entropy.

class Actor(nn.Module):
    def __init__(
        self, 
        in_dim: int, 
        out_dim: int, 
        log_std_min: int = -20, 
        log_std_max: int = 2
    ):
        """Initialization."""
        super(Actor, self).__init__()

        self.log_std_min = log_std_min
        self.log_std_max = log_std_max
        self.hidden = nn.Linear(in_dim, 128)

        self.mu_layer = nn.Linear(128, out_dim)
        self.mu_layer = init_layer_uniform(self.mu_layer)

        self.log_std_layer = nn.Linear(128, out_dim)
        self.log_std_layer = init_layer_uniform(self.log_std_layer)

    def forward(self, state: torch.Tensor) -> torch.Tensor:
        """Forward method implementation."""
        x = F.relu(self.hidden(state))

        mu = torch.tanh(self.mu_layer(x)) * 2
        log_std = torch.tanh(self.log_std_layer(x))
        log_std = self.log_std_min + 0.5 * (
            self.log_std_max - self.log_std_min
        ) * (log_std + 1)
        std = torch.exp(log_std)

        dist = Normal(mu, std)
        action = dist.sample()

        return action, dist

# parameters
num_frames = 150000

agent = PPOAgent(
    env,
    gamma = 0.96,
    tau = 0.9,
    batch_size = 64,
    epsilon = 0.2,
    epoch = 128,
    rollout_len = 128,
    entropy_weight = 0.01
)

Higher entropy weight makes a similar effect as we use parallel environments with multi-processing. I tried 0.01 and will try 0.1 as well.

0 01_entropy

Curt-Park commented 5 years ago

Here is the result of 0.1 entropy. The test result is not so bad.

score:  [[-130.24481804]]
score:  [[-1396.97588997]]
score:  [[-1494.55385998]]
score:  [[-115.96359802]]
score:  [[-4.16239945]]
score:  [[-4.42382684]]
score:  [[-7.16646373]]
score:  [[-129.49037978]]
score:  [[-1495.46961907]]
score:  [[-1285.18972822]]
score:  [[-1302.86093418]]
score:  [[-124.52728289]]
score:  [[-1004.63473437]]
score:  [[-130.21244864]]
score:  [[-1454.02280859]]
score:  [[-241.05102416]]
score:  [[-1367.29442738]]
score:  [[-1190.38101372]]
score:  [[-961.35308555]]
score:  [[-1506.66761708]]
score:  [[-1492.17374091]]
score:  [[-3.96000771]]
score:  [[-8.54655092]]
score:  [[-1191.60825215]]
score:  [[-1517.73246221]]
score:  [[-3.72562266]]
score:  [[-1509.36923303]]
score:  [[-4.7225913]]
score:  [[-1126.90787808]]
score:  [[-125.59924837]]
score:  [[-129.35825879]]
score:  [[-1495.04614324]]
score:  [[-1491.02112512]]
score:  [[-127.37643363]]
score:  [[-1503.07977651]]
score:  [[-1494.94831417]]
score:  [[-4.44921447]]
score:  [[-3.99907054]]
score:  [[-1497.35393934]]
score:  [[-1493.71208883]]
score:  [[-128.47768558]]
score:  [[-1402.3079256]]
score:  [[-5.67091663]]
score:  [[-1060.81727375]]
score:  [[-3.77974653]]
score:  [[-1194.68542529]]
score:  [[-3.911968]]
score:  [[-1504.8581121]]
score:  [[-130.01183786]]
score:  [[-240.59622883]]
score:  [[-3.99122673]]
score:  [[-1205.37393231]]
score:  [[-128.91650906]]
score:  [[-976.77554156]]
score:  [[-947.14712517]]
score:  [[-1217.00858539]]
score:  [[-1273.9406424]]
score:  [[-4.15917789]]
score:  [[-127.34517948]]
score:  [[-1323.52895088]]
score:  [[-4.19671494]]
score:  [[-1512.27547469]]
score:  [[-1361.18254198]]
score:  [[-130.68846096]]
score:  [[-5.53557083]]
score:  [[-127.56373857]]
score:  [[-1393.94396644]]
score:  [[-1520.15532765]]
score:  [[-129.61470288]]
score:  [[-4.25860515]]
score:  [[-5.22567911]]
score:  [[-1499.38414266]]
score:  [[-1203.76151694]]
score:  [[-1281.68922802]]
score:  [[-1199.32929212]]
score:  [[-1497.37679639]]
score:  [[-1452.56743898]]
score:  [[-130.49212716]]
score:  [[-1509.44556013]]
score:  [[-1338.29089836]]
score:  [[-1464.95227618]]
score:  [[-1494.74179155]]
score:  [[-1504.40542729]]
score:  [[-1365.64487701]]
score:  [[-131.80774829]]
score:  [[-3.99753933]]
score:  [[-972.16535748]]
score:  [[-1519.03465733]]
score:  [[-1493.91187715]]
score:  [[-1124.78690286]]
score:  [[-1494.3810296]]
score:  [[-1270.35122268]]
score:  [[-1500.52186423]]
score:  [[-117.59006587]]
score:  [[-1304.48454976]]
score:  [[-1210.65874721]]
score:  [[-1491.19755902]]
score:  [[-122.80648182]]
score:  [[-1496.96335039]]
score:  [[-124.58070664]]

0 1_entropy

I will try 0.01 entropy again with longer steps (200,000) because I missed checking the test results due to the session termination on Colab.

Curt-Park commented 5 years ago

Small modification on your recent settings. We should see the test result because the agent explores during training due to entropy maximization.

Network Architecture

class Actor(nn.Module):
    def __init__(
        self, 
        in_dim: int, 
        out_dim: int, 
        log_std_min: int = -20, 
        log_std_max: int = 2
    ):
        """Initialization."""
        super(Actor, self).__init__()

        self.log_std_min = log_std_min
        self.log_std_max = log_std_max
        self.hidden = nn.Linear(in_dim, 32)  # small size hidden layer

        self.mu_layer = nn.Linear(32, out_dim)
        self.mu_layer = init_layer_uniform(self.mu_layer)

        self.log_std_layer = nn.Linear(32, out_dim)
        self.log_std_layer = init_layer_uniform(self.log_std_layer)

    def forward(self, state: torch.Tensor) -> torch.Tensor:
        """Forward method implementation."""
        x = F.relu(self.hidden(state))

        mu = torch.tanh(self.mu_layer(x)) * 2
        log_std = torch.tanh(self.log_std_layer(x))  # log_std range: (-20, 2)
        log_std = self.log_std_min + 0.5 * (
            self.log_std_max - self.log_std_min
        ) * (log_std + 1)
        std = torch.exp(log_std)

        dist = Normal(mu, std)
        action = dist.sample()

        return action, dist

Hyper-Parameters

# parameters
num_frames = 200000

agent = PPOAgent(
    env,
    gamma = 0.8,
    tau = 0.8,
    batch_size = 64,
    epsilon = 0.2,
    epoch = 64,
    rollout_len = 128,
    entropy_weight = 0.01  # higher entropy weight for lower variance.
)

Training Results

0 01_entropy_200000

Test Results

score:  [[-431.03342891]]
score:  [[-1398.13945718]]
score:  [[-276.60684344]]
score:  [[-277.1123654]]
score:  [[-132.24809792]]
score:  [[-1492.72885617]]
score:  [[-1.64320287]]
score:  [[-136.26063434]]
score:  [[-2.84782626]]
score:  [[-1.78384495]]
score:  [[-406.69635635]]
score:  [[-569.23581057]]
score:  [[-274.53529855]]
score:  [[-591.82886159]]
score:  [[-542.36173822]]
score:  [[-1494.40737499]]
score:  [[-433.59872954]]
score:  [[-1493.65519691]]
score:  [[-1494.25849291]]
score:  [[-1499.44006854]]
score:  [[-137.01595062]]
score:  [[-2.18412965]]
score:  [[-580.94835207]]
score:  [[-442.05376189]]
score:  [[-582.66045183]]
score:  [[-134.85339964]]
score:  [[-1491.09704244]]
score:  [[-132.93166177]]
score:  [[-415.54227379]]
score:  [[-133.90644976]]
score:  [[-434.34562648]]
score:  [[-705.74521929]]
score:  [[-424.30055892]]
score:  [[-135.42478738]]
score:  [[-411.75474363]]
score:  [[-411.11488032]]
score:  [[-596.72162925]]
score:  [[-2.61287619]]
score:  [[-138.76259517]]
score:  [[-743.05337237]]
score:  [[-1.73530657]]
score:  [[-134.00991072]]
score:  [[-424.60862911]]
score:  [[-272.65133305]]
score:  [[-637.28189951]]
score:  [[-719.26566452]]
score:  [[-135.17179552]]
score:  [[-1.63010459]]
score:  [[-276.10223759]]
score:  [[-435.17274083]]
score:  [[-132.56283901]]
score:  [[-408.90098852]]
score:  [[-138.52051094]]
score:  [[-1491.96905612]]
score:  [[-135.96720586]]
score:  [[-132.49526984]]
score:  [[-555.49813357]]
score:  [[-134.33965573]]
score:  [[-710.70227035]]
score:  [[-406.20367026]]
score:  [[-133.65046516]]
score:  [[-432.78668079]]
score:  [[-1492.46580586]]
score:  [[-562.54522204]]
score:  [[-268.74393411]]
score:  [[-1492.60341698]]
score:  [[-140.24355777]]
score:  [[-1495.55961206]]
score:  [[-269.42414056]]
score:  [[-134.7417334]]
score:  [[-558.09156085]]
score:  [[-271.69486058]]
score:  [[-1.95757357]]
score:  [[-414.47218197]]
score:  [[-135.72825472]]
score:  [[-132.86002044]]
score:  [[-266.45104296]]
score:  [[-137.37845622]]
score:  [[-404.61086097]]
score:  [[-556.45957152]]
score:  [[-414.24664247]]
score:  [[-565.97784959]]
score:  [[-266.68150046]]
score:  [[-5.71920144]]
score:  [[-574.94767719]]
score:  [[-267.87260196]]
score:  [[-776.16529309]]
score:  [[-1497.45062946]]
score:  [[-1498.74138661]]
score:  [[-539.14962369]]
score:  [[-134.29569324]]
score:  [[-133.91062409]]
score:  [[-280.75160116]]
score:  [[-1.68397768]]
score:  [[-132.79134791]]
score:  [[-1497.85922345]]
score:  [[-596.4199929]]
score:  [[-1493.18102822]]
score:  [[-273.53954489]]
score:  [[-3.02940193]]

It looks good.

mclearning2 commented 5 years ago

Thank you for your help. I found the std makes the variance, So I fixed the std by 0.5. As you did, I used high entropy by 0.05. I think it makes better too

class Actor(nn.Module):
    def __init__(self, in_dim: int, out_dim: int):
        """Initialization."""
        super(Actor, self).__init__()

        self.hidden = nn.Linear(in_dim, 28)

        self.mu_layer = nn.Linear(28, out_dim)
        self.mu_layer = init_layer_uniform(self.mu_layer)

        self.log_std_layer = nn.Linear(28, out_dim)
        self.log_std_layer = init_layer_uniform(self.log_std_layer)

    def forward(self, state: torch.Tensor) -> torch.Tensor:
        """Forward method implementation."""
        x = F.relu(self.hidden(state))

        mu = torch.tanh(self.mu_layer(x))
        std = torch.ones_like(mu) / 2

        dist = Normal(mu, std)
        action = dist.sample()

        return action, dist

class Critic(nn.Module):
    def __init__(self, in_dim: int):
        """Initialization."""
        super(Critic, self).__init__()

        self.hidden = nn.Linear(in_dim, 80)
        self.out = nn.Linear(80, 1)
        self.out = init_layer_uniform(self.out)

    def forward(self, state: torch.Tensor) -> torch.Tensor:
        """Forward method implementation."""
        x = F.relu(self.hidden(state))
        value = self.out(x)

        return value

# parameters
num_frames = 300000

agent = PPOAgent(
    env,
    gamma = 0.9,
    tau = 0.8,
    batch_size = 64,
    epsilon = 0.2,
    epoch = 64,
    rollout_len = 128,
    entropy_weight = 0.05
)

Curt-Park commented 5 years ago

@mclearning2 This is a tutorial, so we should take a more general and desirable approach that people can take for other problems as well. Fixed std has several drawbacks under the hood.

Entropy maximization doesn't work well with it because it prevents the policy from becoming long-tail distribution.
If the environment is more complicated it possibly doesn't work well due to the above issue. (only exploitation)
We could think the learnable std model has a worse performance for the reason it maximizes entropy and explores while training. We should consider the test results to estimate the performance.

I recommend you to use a learnable std model and tune the upper bound of log_std (lower upper bound).

mclearning2 commented 5 years ago

@Curt-Park, I discussed with @MrSyee. The conclusion is to try log_std with the bound. If the bound doesn't work, we think there are two choices.

Multi environment
fixed std which can be selected.

I will PR today about the result of bounded log_std. Thank you for your recommend.

Curt-Park commented 5 years ago

@mclearning2 It will show a similar performance if you set a narrow std range surrounding 0.5. Fixed std is just a subset of std within a range. Cheers.

mclearning2 commented 5 years ago

@Curt-Park Yeah, I think that is a valid opinion. Thanks :)

Curt-Park commented 5 years ago

FYI

Hidden layer size = 32, max_log_std = 0.5, min_log_std = -20

0 5

Hidden layer size = 32, max_log_std = 0.0, min_log_std = -20

The action is normalized within [-1, 1], so we can more reduce max_log_std.

mclearning2 commented 5 years ago

Yeah, It was similar for me. You're right. Thanks!

class Actor(nn.Module):
    def __init__(
        self, 
        in_dim: int, 
        out_dim: int, 
        log_std_min: int = -20, 
        log_std_max: int = 0
    ):
        """Initialization."""
        super(Actor, self).__init__()

        self.log_std_min = log_std_min
        self.log_std_max = log_std_max
        self.hidden = nn.Linear(in_dim, 32)  # small size hidden layer

        self.mu_layer = nn.Linear(32, out_dim)
        self.mu_layer = init_layer_uniform(self.mu_layer)

        self.log_std_layer = nn.Linear(32, out_dim)
        self.log_std_layer = init_layer_uniform(self.log_std_layer)

    def forward(self, state: torch.Tensor) -> torch.Tensor:
        """Forward method implementation."""
        x = F.relu(self.hidden(state))

        mu = torch.tanh(self.mu_layer(x)) * 2
        log_std = torch.tanh(self.log_std_layer(x))
        log_std = self.log_std_min + 0.5 * (
            self.log_std_max - self.log_std_min
        ) * (log_std + 1)
        std = torch.exp(log_std)

        dist = Normal(mu, std)
        action = dist.sample()

        return action, dist

env,
gamma = 0.9,
tau = 0.8,
batch_size = 64,
epsilon = 0.2,
epoch = 64,
rollout_len = 256,
entropy_weight = 0.005

Curt-Park commented 5 years ago

@mclearning2 It is not safe from the sigmoid saturation issue. If the value goes to 0 during a long time learning, the agent will die.

Add a very small value, e.g. 1e-7.

mclearning2 commented 5 years ago

@Curt-Park I changed the model like you and edited it.

mclearning2 commented 5 years ago

I'll tune the optimal parameters and finish this PR by today.

Curt-Park commented 5 years ago

@mclearning2 Good luck

mclearning2 commented 5 years ago

rollout_len helps to reduce variance, but it takes a longer time.

Model

class Actor(nn.Module):
    def __init__(
        self, 
        in_dim: int, 
        out_dim: int, 
        log_std_min: int = -20,
        log_std_max: int = 0
    ):
        """Initialization."""
        super(Actor, self).__init__()

        self.log_std_min = log_std_min
        self.log_std_max = log_std_max
        self.hidden = nn.Linear(in_dim, 32)

        self.mu_layer = nn.Linear(32, out_dim)
        self.mu_layer = init_layer_uniform(self.mu_layer)

        self.log_std_layer = nn.Linear(32, out_dim)
        self.log_std_layer = init_layer_uniform(self.log_std_layer)

    def forward(self, state: torch.Tensor) -> torch.Tensor:
        """Forward method implementation."""
        x = F.relu(self.hidden(state))

        mu = torch.tanh(self.mu_layer(x))
        log_std = torch.tanh(self.log_std_layer(x))
        log_std = self.log_std_min + 0.5 * (
            self.log_std_max - self.log_std_min
        ) * (log_std + 1)
        std = torch.exp(log_std)

        dist = Normal(mu, std)
        action = dist.sample()

        return action, dist

class Critic(nn.Module):
    def __init__(self, in_dim: int):
        """Initialization."""
        super(Critic, self).__init__()

        self.hidden = nn.Linear(in_dim, 64)
        self.out = nn.Linear(64, 1)
        self.out = init_layer_uniform(self.out)

    def forward(self, state: torch.Tensor) -> torch.Tensor:
        """Forward method implementation."""
        x = F.relu(self.hidden(state))
        value = self.out(x)

        return value

parameter

agent = PPOAgent(
    env,
    gamma = 0.9,
    tau = 0.8,
    batch_size = 64,
    epsilon = 0.2,
    epoch = 64,
    rollout_len = 256,
    entropy_weight = 0.005
)

rollout_len = 256

rollout_len = 1024

rollout_len = 2048

Curt-Park commented 5 years ago

Significant results. PPO is yours now.

Curt-Park commented 5 years ago

Please add comma after this line. log_std_max: int = 0

MrSyee commented 5 years ago

Great! I fix a little typo and approve this PR. Thank you!

MrSyee / pg-is-all-you-need

Add PPO #7

Graph is same

Network Architecture

Hyper-Parameters

Training Results

Test Results

Hidden layer size = 32, max_log_std = 0.5, min_log_std = -20

Hidden layer size = 32, max_log_std = 0.0, min_log_std = -20

Model

parameter

rollout_len = 256

rollout_len = 1024

rollout_len = 2048