dennybritz / cnn-text-classification-tf

Convolutional Neural Network for Text Classification in Tensorflow
Apache License 2.0
5.64k stars 2.77k forks source link

accuracy vs CNN_sentence #9

Closed j314erre closed 8 years ago

j314erre commented 8 years ago

Thank you for this helpful reference implementation in tensorflow.

I'm running your code out-of-the-box but with hyperparameters chosen to match Yoon Kim's theano implementation https://github.com/yoonkim/CNN_sentence

I'm finding accuracy is much lower on the same data set over the same number of mini-batches / epochs...(not using pre-trained word2vec)

I am going to look at model topology, learning rate, dropout, L2, optimizer, loss function, etc. to get to the bottom of this and make sure it is an apples-to-apples comparison, but if you know where to focus efforts any help is appreciated.

screen shot 2016-03-22 at 11 08 49 am
dennybritz commented 8 years ago

Hm, which dataset are you comparing? Your own?

A bit of a discrepancy is expected, but not as big as the one in your graph. In my blog post I got accuracy pretty similar to Kim on the movie review dataset.

Is your graph accuracy on the training set? The dev set? Graphing the loss may also be helpful. The first thing I would look at are definitely the network hyperparameters (embedding, filter sizes, num filters, dropout, etc). I'm sure Kim's code has other defaults than mine. I doubt that the optimizer is the problem.

I'd start by graphing the training set accuracy and see if you are able to overfit it - if you cannot your network is probably not "big" enough.

j314erre commented 8 years ago

I am comparing the exact same rt-polarity data set in both cases with the exact same hyperparameters. The graph is on the dev set. It should be apples-to-apples. So far I have increased the learning rate, added L2 regularization, and added padding to sentences the way Kim did which has all help to increase the accuracy to closer to the 0.75 level that Kim's code is using. I have found the choice of optimizer does make a difference, but unfortunately tensorflow does not support Adadelta which Kim used so I can't do a true apples-to-apples. Working on getting the rest of the way to the 0.78 accuracy that Kim's code is showing out-of-the-box....stay tuned.

dennybritz commented 8 years ago

Can you also plot the train set? That may give some insight into what's going on.

j314erre commented 8 years ago

Here are accuracy and loss plots for both models.

Basically running Kim's code like this so it does not use pretrained word2vec:

THEANO_FLAGS=mode=FAST_RUN,device=cpu,floatX=float32 python conv_net_sentence.py -nonstatic -rand

and your code like this to replicate Kim's hard-wired params:

python train.py --embedding_dim 300 --num_filters 100 --batch_size 50 --num_epochs 25 --l2_reg_lambda 0.15

...attempting to run the same sized model on the same data, except for different optimizer as noted above...also Kim's model does not seem to include L2 lambda regularization...

Looks like Kim's model is overfitting, but it is getting 15% higher accuracy on the dev set. I might be missing something obvious here.

image image

image image

j314erre commented 8 years ago

I think I've gotten to the bottom of the performance differences on the rt-polarity dataset, and since there is no inherent problem with this tensorflow implementation, I'll close this issue.

In summary, I was able to replicate the accuracy of CNN_Sentence by making a few simple code changes to cnn-text-classification-tf to mimic approach to weight initialization, sentence padding, and learning rate.

For future reference, I'll detail my findings below.

In all cases I ran the code with the following commands to equalize the model sizes at all layers:

THEANO_FLAGS=mode=FAST_RUN,device=cpu,floatX=float32 python conv_net_sentence.py -nonstatic -rand

python train.py --embedding_dim 300 --num_filters 100 --batch_size 50 --num_epochs 25 --l2_reg_lambda 0.15

I've found the difference in dev set accuracy over the same number of epoch steps between cnn-text-classification-tf and CNN_Sentence can be attributed by these implementation differences:

Here is a graph of dev set accuracy comparing CNN_Sentence to cnn-text-classification-tf run out-of-the-box, plus two versions where I made code changes to cnn-text-classification-tf to address the above differences:

image

I made code changes to my copy of cnn-text-classification-tf as follows:

+zero_weights

Initialized output weights to 0.0 in text_cnn.py:

        # Final (unnormalized) scores and predictions
        with tf.name_scope("output"):
            #W = tf.Variable(tf.truncated_normal([num_filters_total, num_classes], stddev=0.1), name="W")
            W = tf.Variable(tf.constant(0.0, shape=[num_filters_total, num_classes]), name="W")
            b = tf.Variable(tf.constant(0.0, shape=[num_classes]), name="b")
            ...

+initial_padding

Adding padding symbols to beginning of sentences and extended all sentences to length 64 by padding the ends in data_helpers.py:

def pad_sentences(sentences, padding_word="<PAD/>", max_filter=5):
    """
    Pads all sentences to the same length. The length is defined by the longest sentence.
    Returns padded sentences.
    """
    pad_filter = max_filter -1
    sequence_length = max(len(x) for x in sentences)
    sequence_length = sequence_length + 2*pad_filter
    #print "sequence_length=%d" % sequence_length
    #sequence_length = 64
    padded_sentences = []
    for i in range(len(sentences)):
        sentence = sentences[i]
        num_padding = sequence_length - len(sentence) - pad_filter
        new_sentence = [padding_word]*max_filter + sentence + [padding_word] * num_padding
        padded_sentences.append(new_sentence)
    return padded_sentences

+learning_rate

Increased learning rate for optimizer in train.py:

optimizer = tf.train.AdamOptimizer(0.001)

dennybritz commented 8 years ago

Wow, thanks for getting to the bottom of this.

I'm surprised that the 0-initialization works better. Initializing biases to 0 is pretty standard, but initializing W to 0 seems strange to me. Maybe another kind of initialization works just as well, or better.

What's the intuition behind the extra padding?

j314erre commented 8 years ago

I was also surprised that these where the changes that accounted for as much of the difference as they did and don't really know why, except they were in Y. Kim's implementation and I was just trying to understand CNN for text classification using tensorflow vs theano in an equivalent model. I suspect these changes are specific to this dataset (which is small & noisy) and not indicative of best practice in other datasets and maybe not even this one! But here are some thoughts...

On the weights initialization: I noticed that the training minibatch cross entropy loss was initially quite high like above ~5.0 and it took half of the epochs to get below 1.0, whereas on a binary classification problem you'd expect a random coin flip classifier to start out at -ln(0.5) ~ 0.69 and go down from there (which I found that Y. Kim's code did)...as the above loss graphs show. I traced the discrepancy to the initialization of W and b in the last layer where the initial random values created a very high loss value at the starting gate and the optimizer was taking a long time to climb down to a minimum. Also with l2 regularization,the initial l2 loss might have had too much an effect in the early stages. Note that Y. Kim used an AdaDelta optimizer not available in tensorflow so we're comparing to the Adam.....it looks like Kim's code is converging faster, but is clearly over-fitting...whereas your original model is behaving well over time and it might be that running it over many more epochs you'd actually have a more robust and accurate model.

On zero padding: I'm guessing that putting padding at the beginning of the sentences allows the model to discriminate words appearing near the beginning of sentences...it was already padding the ends. This might allow the model to pick out some patterns involving the global placement of words in a sentence. The other idea is that it means more feature maps per sequence which might just be a good thing?

Anyway thanks again for this solid implementation of a CNN model for text classification in tensorflow.

dennybritz commented 8 years ago

I see, that makes sense on a high level. For the padding, maybe a wide convolution would get around that and perform just as well.

I guess this shows how important preprocessing and initialization are :(

chaitjo commented 8 years ago

Hey @j314erre! Thanks for this. :+1:

I was wondering how to implement the pad_sentences part. At what point in train.py do I apply the function to all my sentences? I'm guessing before it learns the vocabulary.

Also, regarding padding_word, how do I implement it if I am using pre-trained 300 dimensional Google embeddings? I don't think there would be a padding symbol in those word vectors. Should I hard-code a condition to make it a zero matrix of the same dimensions as the word vectors?

j314erre commented 8 years ago

My suggestions above relate to an older version of the code...

Easiest thing is to look at the repository on "Commits on Apr 2, 2016" for train.py before it got refactored to use VocabularyProcessor. The idea is to swap in my example for pad_sequences instead of the one in data_helpers.py. Then all the padding and vocab building will take care of itself in that repository snapshot (or you can at least see how the code worked back then and re-implement something similar off of the latest version.)

For pre-trained word vectors, I would use those to initialize your embedding, but have your embedding continue to learn & adapt word vectors from your training data. Therefore any words that are important in your data set but that didn't happen to be in the pre-trained vocabulary will be taken into account by the embedding. The pad symbol is just one of those words not in the pre-trained set.

For clarity:

Arpitha1996 commented 6 years ago

pls help me to run this code,ie give the procedures to run in the code in ubuntu 14.04 version