How to constrain L2-Norm of weights in the last layer as Kim did?

Psycho7 commented 7 years ago

I'm new to both NLP and TensorFlow.

I found a way to constrain gradients as the following code:

global_step = tf.Variable(0, name="global_step", trainable=False)
optimizer = tf.train.AdamOptimizer(1e-3)
grads_and_vars = optimizer.compute_gradients(cnn.loss)

for i, (g, v) in enumerate(grads_and_vars):
    if g is not None:
        grads_and_vars[i] = (tf.clip_by_norm(g, 3), v)

train_op = optimizer.apply_gradients(grads_and_vars, global_step=global_step)

I was thinking whether I can modify weights here and I tried grads_and_vars[i] = (tf.clip_by_norm(g, 3), (tf.clip_by_norm(v, 3)). However, that didn't work :(

Can I get some help from you?

Psycho7 commented 7 years ago

Here is what I did.

        # Final (unnormalized) scores and predictions
        with tf.name_scope("output"):
            self.output_W = tf.get_variable( 
                "W",
                shape=[num_filters_total, num_classes],
                initializer=tf.contrib.layers.xavier_initializer())
            b = tf.Variable(tf.constant(0.1, shape=[num_classes]), name="b")
            l2_loss += tf.nn.l2_loss(self.output_W)
            l2_loss += tf.nn.l2_loss(b)
            self.scores = tf.nn.xw_plus_b(self.h_drop, self.output_W, b, name="scores")
            self.predictions = tf.argmax(self.scores, 1, name="predictions")

        def train_step(x_batch, y_batch):
            feed_dict = {
              cnn.input_x: x_batch,
              cnn.input_y: y_batch,
              cnn.dropout_prob: FLAGS.dropout_prob
            }
            _, step, summaries, loss, accuracy = sess.run(
                [train_op, global_step, train_summary_op, cnn.loss, cnn.accuracy],
                feed_dict)
            sess.run(cnn.output_W.assign(tf.clip_by_norm(cnn.output_W, 3.0)))
            time_str = datetime.datetime.now().isoformat()
            print("{}: step {}, loss {:g}, acc {:g}".format(time_str, step, loss, accuracy))
            train_summary_writer.add_summary(summaries, step)

The accuracy becomes 72% but not 76% as you said in your blog,

I experimented with adding additional L2 penalties for the weights at the last layer and was able to bump up the accuracy to 76%, close to that reported in the original paper

I'm not sure if I did it correctly :(

kn45 commented 7 years ago

it seems that you clip w after all the training steps. should clip be added to each gradient desc step? clip is operated on w to norm, not on gradient preventing grad explo.

Psycho7 commented 7 years ago

Those codes for clipping W is in file text_cnn.py. So I think clipping is performed every step, at least it's what I want. I might be wrong cos I'm not familiar with TF 😢

As for that clipping operation for grad, someone told me it's a good idea and I don't think it bad. Too eager to learn, right?

hkhatod commented 6 years ago

can someone please shed some light on why the output of l2_loss(b) is added to the same variable as l2_loss(self.out_put_W) ?

l2_loss += tf.nn.l2_loss(self.output_W) l2_loss += tf.nn.l2_loss(b)

Psycho7 commented 6 years ago

@hkhatod This is called L2-Normalization. The general idea is to keep structural risk minimal.

csyanbin commented 6 years ago

I am also considering this issue. I think you can refer to this original theano code. https://github.com/yoonkim/CNN_sentence/blob/master/conv_net_sentence.py#L227

It seems except for the first layer, all weights are clipped. But I think clip the classifier layer is not proper, since it may affect the final probability for classifier scores.

Psycho7 commented 6 years ago

@csyanbin Agree. In some papers, they claim clipping is not elegant. It's more like an empirical trick.

dennybritz / cnn-text-classification-tf

How to constrain L2-Norm of weights in the last layer as Kim did? #88