Implementation of PPO loss

In your implementation of the PPO loss, do you not need to collapse both prob and old_prob down to a single scalar per row, instead of a vector with a single non-zero entry? Otherwise, it seems that you flood the loss with negative numbers if the advantage is negative.

See below for a walkthrough of what I think is happening - I've split the implementation into a few extra variables and removed entropy for ease of explanation

def proximal_policy_optimization_loss(advantage, old_prediction):
    def loss(y_true, y_pred):
        prob = y_true * y_pred
        old_prob = y_true * old_prediction
        r = prob/(old_prob + 1e-10)
        clipped = K.clip(r, min_value=0.8, max_value=1.2)
        ppo_loss = K.minimum(r * advantage, clipped * advantage)

        return -K.mean(ppo_loss)

Let's say we have this input

y_true = [0, 1, 0]
y_pred = [0.3, 0.5, 0.2]
old_prediction = [0.3, 0.4, 0.3]
advantage = -1

Then I calculate the following:

prob = [0, 0.5, 0]
old_prob = [0, 0.4, 0]
r = [0, 1.25, 0]
clipped = [0.8, 1.2, 0.8]
ppo_loss = [-0.8, -1.25, -0.8]
return -0.9333

However, my understand of PPO loss is that only the chosen action should be used in the calculation - that is, we should collapse the vector early, so that we're working with a single scalar per row:

prob = K.sum(y_true * y_pred, axis = 1)
old_prob = K.sum(y_true * old_prediction, axis = 1)

This way, we get the following:

prob = 0.5
old_prob = 0.4
r = 1.25
clipped = 1.2
ppo_loss = -1.25
return -1.25

Is this correct? If not, what am I misunderstanding?

Thanks for your help!

David

LuEE-C / PPO-Keras

Implementation of PPO loss #8