Frequency of updating Target network

DongChen06 commented 4 years ago

Hi, it seems you update the target network every step along with the q network in IQL network. Shown as in the policies.py.

        with tf.variable_scope(self.name + '_q', reuse=True):
            q0s = self._build_net(self.S)
            q0 = tf.reduce_sum(q0s * A_sparse, axis=1)
        with tf.variable_scope(self.name + '_q', reuse=True):
            q1s = self._build_net(self.S1)
            q1 = tf.reduce_max(q1s, axis=1)
        tq = tf.stop_gradient(tf.where(self.DONE, self.R, self.R + gamma * q1))
        self.loss = tf.reduce_mean(tf.square(q0 - tq))

        wts = tf.trainable_variables(scope=self.name)
        grads = tf.gradients(self.loss, wts)
        if max_grad_norm > 0:
            grads, self.grad_norm = tf.clip_by_global_norm(grads, max_grad_norm)
        self.lr = tf.placeholder(tf.float32, [])
        self.optimizer = tf.train.AdamOptimizer(learning_rate=self.lr)
        self._train = self.optimizer.apply_gradients(list(zip(grads, wts)))

Should we try to update the target network every N(1000?) steps? Thanks!

cts198859 commented 4 years ago

The above code is for running the update based on a sampled minibatch. To find the update frequency, you can go to IQL.backward() of agents/models.py. Under the default setting, once we collect a minibatch of experience (size n_step), we update IQL 10 times. You are right: for fair comparison, we should use a similar updating frequency as that of MA2C/IA2C, which is, once per minibatch. However in this way the advantage of off-policy learning was ignored so I made it 10x. As an alternative approach, we can keep the same updating frequency but make the IQL baseline stronger w/ DDPG-like updates.

DongChen06 commented 4 years ago

Hi, thank you for your reply! Do we need to keep the parameters of the target network frozen and update it a little bit slower than the Q-network? BTW, I am still confused why we can update IQL 10 times at each time? Why we choose 10 and any difference between 1 time and 10 times update? Thank you!

cts198859 commented 4 years ago

Q-learning is off-policy so we can update Q-net any times -- each update is based on a sampled minibatch from the replay buffer rather than the recently collected one. More updates, more learning (exploitation) on the experience so far. Only original DQN is implemented here, so a single Q-net is used for both target and behavior policies (and freezing it may hurt exploration efficiency). You may need to modify the code to implement DDPG-like update by maintaining both a behavior Q-net and a less frequently updated target Q-net, and additional behavior and target policy-nets if needed. However, as pointed out in the paper, off-policy learning cannot learn the time-variant transitions, especially under partial observability. This is why on-policy learning like A2C is used. On the other hand, investigating the improvement on off-policy learning would be an interesting future direction. For example, n-step TD based Q-learning [1] may be a good start point, but you may need to extend it to MARL settings. [1] Nachum, Ofir, et al. "Bridging the gap between value and policy based reinforcement learning." Advances in Neural Information Processing Systems. 2017.

DongChen06 commented 4 years ago

Thank you for your patient and insights. I may try to extend the method to a DDPG-like algorithm and look at its improvement.

cts198859 / deeprl_signal_control

Frequency of updating Target network #23