PPO ratio between old policy and new policy

jbakams commented 3 years ago

Hi @abhisheksuran, thank you for the amazing implementation. Seems like the ratio between the old probability and new probability will always be 1 as they are always the same in the PPO code. I used the following approach to be sure they are likely to be different ( see lines commented with many #######). I may be wrong,. If so! Please enlighten me.

class agent():
    def __init__(self, gamma = 0.99):
        self.gamma = gamma
        # self.a_opt = tf.keras.optimizers.Adam(learning_rate=1e-5)
        # self.c_opt = tf.keras.optimizers.Adam(learning_rate=1e-5)
        self.a_opt = tf.keras.optimizers.Adam(learning_rate=7e-3) ### 3e-3
        self.c_opt = tf.keras.optimizers.Adam(learning_rate=7e-3) ### 3e-3
        self.actor = actor()
        self.critic = critic()

        self.old_probs = None ####################### This variable will stock old probability

        self.clip_pram = 0.2

And then in the function learn:

def learn(self, states, actions,  adv , discnt_rewards):  ############# no needs of old_probs as parameter
        discnt_rewards = tf.reshape(discnt_rewards, (len(discnt_rewards),))
        adv = tf.reshape(adv, (len(adv),))

        old_probs = self.old_probs ##################### assign old_probs

        old_p = old_probs ### I didn't see where you used old_p in following lines

        old_p = tf.reshape(old_p, (len(old_p),2))
        with tf.GradientTape() as tape1, tf.GradientTape() as tape2:
            p = self.actor(states, training=True)
            v =  self.critic(states,training=True)
            v = tf.reshape(v, (len(v),))
            td = tf.math.subtract(discnt_rewards, v)
            c_loss = 0.5 * kls.mean_squared_error(discnt_rewards, v)
            a_loss = self.actor_loss(p, actions, adv, old_probs, c_loss)

        self.old_probs = p.numpy() ################## Keep previous probability before applying the gradient to the actor

        grads1 = tape1.gradient(a_loss, self.actor.trainable_variables)
        grads2 = tape2.gradient(c_loss, self.critic.trainable_variables)
        self.a_opt.apply_gradients(zip(grads1, self.actor.trainable_variables))
        self.c_opt.apply_gradients(zip(grads2, self.critic.trainable_variables))
        return a_loss, c_loss

Then to be sure to have a set of of old_probability for the first run

  value = agentoo7.critic(np.array([state])).numpy()
  values.append(value[0][0])
  np.reshape(probs, (len(probs),2))
  probs = np.stack(probs, axis=0)

  if s == 0:   ########################## set old_probabilities  to the same as actual probabilities at the first run
    agentoo7.old_probs = np.copy(p)

  states, actions,returns, adv  = preprocess1(states, actions, rewards, dones, values, 1)
  for epocs in range(10):
     al,cl = agentoo7.learn(states, actions, adv, returns) ########### no need of probs as parameter

abhisheksuran commented 3 years ago

Hi, thanks for writing. Ratio b/w old and new prob is 1 only for the first epoch. And Agent is trained for 10 epoch for each set of experience. Agent will show no learning if ratio is always 1. I Hope that helped. Thanks.

jbakams commented 3 years ago

@abhisheksuran Yeah, I have seen what I missed!! Thanks.

abhisheksuran / Reinforcement_Learning

PPO ratio between old policy and new policy #4