adeshpande3 / LSTM-Sentiment-Analysis

Sentiment Analysis with LSTMs in Tensorflow
MIT License
980 stars 431 forks source link

Accuracy for Test Data Varies #15

Open dbl001 opened 6 years ago

dbl001 commented 6 years ago

I'm running on Tensorflow version: 1.4.0 Anaconda Python 3.6 OS X 10.11.6 No GPU I trained the models in my own environment:

iterations = 10 for i in range(iterations): nextBatch, nextBatchLabels = getTestBatch(); print("Accuracy for this batch:", (sess.run(accuracy, {input_data: nextBatch, labels: nextBatchLabels})) * 100)

Accuracy for this batch: 87.5 Accuracy for this batch: 75.0 Accuracy for this batch: 83.3333313465 Accuracy for this batch: 95.8333313465 Accuracy for this batch: 83.3333313465 Accuracy for this batch: 91.6666686535 Accuracy for this batch: 91.6666686535 Accuracy for this batch: 79.1666686535 Accuracy for this batch: 87.5 Accuracy for this batch: 79.1666686535

Any ideas why the accuracy varies so much for each batch? I tried running against the pre-trained model, but tensorflow 1.4.0 can't process the file. Here's my tensorboard output: screen shot 2017-12-13 at 10 48 06 am

adeshpande3 commented 6 years ago

My initial thought is that a big problem with these LSTM/RNN models is combatting the issue of overfitting to the training data, and judging from your training curves, I think it's safe to say that the network definitely has learned the training data, but it might not be able to generalize to newer examples, and thus the reasoning for the fluctuating test accuracy.

Since this tutorial was just to mainly get people exposed to NLP tasks and using LSTMs/RNNs in Tensorflow, I didn't include these in the code, but what I think would be helpful is thinking about adding some types of regularization, thinking about using just RNNs (since the LSTMs might just be contributing to the overfitting problem), using early stopping, splitting your data into train/test/validation instead of just train/test so that you can see where the validation accuracy drops off, etc etc.

Hope this helps!

dbl001 commented 6 years ago

Do you see anything wrong with my simple test (below)?

screen shot 2017-12-14 at 9 29 08 am

adeshpande3 commented 6 years ago

No, like I don't think there's anything wrong with the code itself, I'm just saying that I think the network has overfit to the training data, and thus can't answer future queries with the best accuracy. So the fix to that is tuning hyperparameters, adding regularization, the things I mentioned in the post above, etc.

dbl001 commented 6 years ago

"With four parameters I can fit an elephant and with five I can make him wiggle his trunk." -John von Neumann, cited by Enrico Fermi in Nature 427

On Dec 14, 2017, at 9:38 AM, Adit Deshpande notifications@github.com wrote:

No, like I don't think there's anything wrong with the code itself, I'm just saying that I think the network has overfit to the training data, and thus can't answer future queries with the best accuracy. So the fix to that is tuning hyperparameters, adding regularization, the things I mentioned in the post above, etc.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/adeshpande3/LSTM-Sentiment-Analysis/issues/15#issuecomment-351783124, or mute the thread https://github.com/notifications/unsubscribe-auth/AC9i24cJriPw8xCMSG0_SX66cbn6JPZOks5tAV0VgaJpZM4RBAgU.

dbl001 commented 6 years ago

What about ‘not enough training examples’?

https://venturebeat.com/2017/10/23/google-brain-chief-says-100000-examples-is-enough-data-for-deep-learning/

On Dec 14, 2017, at 9:38 AM, Adit Deshpande notifications@github.com wrote:

No, like I don't think there's anything wrong with the code itself, I'm just saying that I think the network has overfit to the training data, and thus can't answer future queries with the best accuracy. So the fix to that is tuning hyperparameters, adding regularization, the things I mentioned in the post above, etc.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/adeshpande3/LSTM-Sentiment-Analysis/issues/15#issuecomment-351783124, or mute the thread https://github.com/notifications/unsubscribe-auth/AC9i24cJriPw8xCMSG0_SX66cbn6JPZOks5tAV0VgaJpZM4RBAgU.

adeshpande3 commented 6 years ago

I think ML and DL especially is notorious for having a very long list of reasons for why a model might not work effectively, and yeah I think amount of training examples is definitely on that list. In regards to that particular quote, I think it definitely has to be taken in context of your problem space, so I don't think 100,000 should be a hard and fast rule or anything.

dbl001 commented 6 years ago

I have a ‘word-sense disambiguation’ question:

Word2vec presumably captures all the word senses in it’s encoding, however, each sense would change the values/distribution of factors. Does the LSTM which scans each word, does a work lookup and factor in the appropriate words in the context of the surrounding words, help to pinpoint the words sense? E.g. - This might help in sentences with sarcasm.

On Dec 14, 2017, at 11:14 AM, Adit Deshpande notifications@github.com wrote:

I think ML and DL especially is notorious for having a very long list of reasons for why a model might not work effectively, and yeah I think amount of training examples is definitely on that list. In regards to that particular quote, I think it definitely has to be taken in context of your problem space, so I don't think 100,000 should be a hard and fast rule or anything.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/adeshpande3/LSTM-Sentiment-Analysis/issues/15#issuecomment-351807713, or mute the thread https://github.com/notifications/unsubscribe-auth/AC9i24ESAd7LYL0pSgeoKKCeNiEN4bLyks5tAXNlgaJpZM4RBAgU.

adeshpande3 commented 6 years ago

Hmm it's tough to tell if the LSTM units would pick up on that, I think the more likely case is that when the word vectors are getting generated from Word2Vec, it will inevitably see a lot of examples where the word (such as flies) is used in the bug context as well as in the verb context, given that you're training on a large enough corpus as well. In a way, I think Word2Vec kinda "averages" the effect from seeing the word in both contexts. Check this thread for more thoughts on that https://www.reddit.com/r/LanguageTechnology/comments/3jerqt/distinguishing_different_meanings_of_a_word/

dbl001 commented 6 years ago

https://arxiv.org/pdf/1511.06388.pdf

On Dec 14, 2017, at 11:37 AM, Adit Deshpande notifications@github.com wrote:

Hmm it's tough to tell if the LSTM units would pick up on that, I think the more likely case is that when the word vectors are getting generated from Word2Vec, it will inevitably see a lot of examples where the word (such as flies) is used in the bug context as well as in the verb context, given that you're training on a large enough corpus as well. In a way, I think Word2Vec kinda "averages" the effect from seeing the word in both contexts. Check this thread for more thoughts on that https://www.reddit.com/r/LanguageTechnology/comments/3jerqt/distinguishing_different_meanings_of_a_word/

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

dbl001 commented 6 years ago

I tried early stopping after 30,000 and also 50,000 - not much improvement. I tried adjusting the dropout from .75 to .5 - not much improvement. Next up, regularization and replacing LSTM with a RNN).

On Dec 14, 2017, at 9:22 AM, Adit Deshpande notifications@github.com wrote:

My initial thought is that a big problem with these LSTM/RNN models is combatting the issue of overfitting to the training data, and judging from your training curves, I think it's safe to say that the network definitely has learned the training data, but it might not be able to generalize to newer examples, and thus the reasoning for the fluctuating test accuracy.

Since this tutorial was just to mainly get people exposed to NLP tasks and using LSTMs/RNNs in Tensorflow, I didn't include these in the code, but what I think would be helpful is thinking about adding some types of regularization, thinking about using just RNNs (since the LSTMs might just be contributing to the overfitting problem), using early stopping, splitting your data into train/test/validation instead of just train/test so that you can see where the validation accuracy drops off, etc etc.

Hope this helps!

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/adeshpande3/LSTM-Sentiment-Analysis/issues/15#issuecomment-351778875, or mute the thread https://github.com/notifications/unsubscribe-auth/AC9i27hRFsA04SsHzTrA_JOQvSOL51XLks5tAVlZgaJpZM4RBAgU.

dbl001 commented 6 years ago

LSTM with regularization is the best so far. Have you seen better numbers?

LSTM w/regularization, dropout=0.50, 100,000 iterations Accuracy for this batch: 87.5 Accuracy for this batch: 79.1666686535 Accuracy for this batch: 83.3333313465 Accuracy for this batch: 83.3333313465 Accuracy for this batch: 83.3333313465 Accuracy for this batch: 66.6666686535 Accuracy for this batch: 75.0 Accuracy for this batch: 79.1666686535 Accuracy for this batch: 79.1666686535 Accuracy for this batch: 83.3333313465

RNN w/regularization, dropout=0.5, 100,000 iterations Accuracy for this batch: 50.0 Accuracy for this batch: 54.1666686535 Accuracy for this batch: 50.0 Accuracy for this batch: 62.5 Accuracy for this batch: 66.6666686535 Accuracy for this batch: 62.5 Accuracy for this batch: 41.6666656733 Accuracy for this batch: 50.0 Accuracy for this batch: 45.8333343267 Accuracy for this batch: 54.1666686535

RNN no regularization, dropout=0.50, 100,000 iterations Accuracy for this batch: 70.8333313465 Accuracy for this batch: 50.0 Accuracy for this batch: 58.3333313465 Accuracy for this batch: 41.6666656733 Accuracy for this batch: 54.1666686535 Accuracy for this batch: 66.6666686535 Accuracy for this batch: 66.6666686535 Accuracy for this batch: 54.1666686535 Accuracy for this batch: 54.1666686535 Accuracy for this batch: 58.3333313465

On Dec 21, 2017, at 2:07 PM, David Laxer davidl@softintel.com wrote:

adjusting

anil215 commented 6 years ago

@dbl001 can you share your code ?

dbl001 commented 6 years ago

Hyper-parameters:

Stopped early: 70,000 iterations Drop-out: output_keep_prob=0.75 Regularization: regularizer = tf.contrib.layers.l2_regularizer(scale=0.1) reg_constant = 0.01

Accuracy for this batch: 70.8333313465 Accuracy for this batch: 79.1666686535 Accuracy for this batch: 87.5 Accuracy for this batch: 87.5 Accuracy for this batch: 79.1666686535 Accuracy for this batch: 95.8333313465 Accuracy for this batch: 79.1666686535 Accuracy for this batch: 75.0 Accuracy for this batch: 83.3333313465 Accuracy for this batch: 62.5

What do you think?

On Jan 3, 2018, at 4:55 AM, ANIL notifications@github.com wrote:

@dbl001 https://github.com/dbl001 can you share your code ?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/adeshpande3/LSTM-Sentiment-Analysis/issues/15#issuecomment-355005559, or mute the thread https://github.com/notifications/unsubscribe-auth/AC9i2zgXyhTxqbnXHi3zUYZu9lgd1XKYks5tG3jPgaJpZM4RBAgU.