Multiclass classification and pre-trained word embedding (word2vec & GloVe) support and it's comparison

cahya-wirawan commented 7 years ago

Hi Denny,

Thanks very much for the great blog and the source code. I just started with machine learning and this article helped me a lot to learn machine learning, especially with text classification. I extended your code to support multiclass classification and to use word embeddings with either word2vec or GloVe. The code is available at: https://github.com/cahya-wirawan/cnn-text-classification-tf

I made a comparison of multiclass classification and with/without pre-trained word embeddings. The 20newsgroup text dataset has been used for this purpose. I used only 4 from 20 available topics from the dataset, but it is easy to change it, I provided a configuration file to simplify the support for multiclass functionality, difference datasets and pre-trained word embedding.

Moreover, I have compared the result of this cnn text classification with naive bayesian and support vector machine (SVM) using the scikit-learn: https://github.com/cahya-wirawan/ML-Collection/blob/master/TextClassification.py

Here is the result:

Metric	Naive Bayes	SVM	CNN w/o pre-trained word embedding	CNN with word2vec	CNN with GloVe
Accuracy	0.83	0.91	0.88	0.95	0.95
Training time	< 1s	< 1s	> 6h	> 6h	> 6h

The accuracy of text classification using CNN but without pre-trained word embedding is less then SVM, but still better than the accuracy of NB. The pre-trained word embeddings improves so the accuracy of CNN and outperform the SVM's accuracy with 0.95. It seems it doesn't matter which pre-trained word embedding it uses, their accuracy are quite the same. The down side of this higher accuracy comparing to Bayes or SVM is the training time. Neural Network needs several hours or more for the training to achieve this accuracy, where SVM just needs a second to train the same dataset with respectable accuracy.

fighting41love commented 7 years ago

Thanks for sharing the code!

dinara92 commented 7 years ago

Thanks for sharing! One thing I would like to point out. You mentioned here, it takes about 6 hours to train your CNN implementation. In your code, I found that number of epochs are up to 200. I guess that's why it might be taking so long? Have you tried with lesser number of epochs, like in Yoon Kim et.al and Denny Britz's variations of code (25 epochs and 30 epochs, accordingly)? And if you did, how was the accuracy? As fyi, I followed Keras' tutorial on CNN classification with pre-trained embeddings, where they also train a model on 20 newsgroup dataset. After about 15 epochs I could achieve accuracy of about 89.6%.

stevewyl commented 7 years ago

I changed the embedding file to glove.6B.50d and increased the batch size. However, I still got out of GPU memory OOM error, how can I fix this problem on device GTX1060 3GB?

dinara92 commented 7 years ago

me too, I couldn't load word2vec Google News corpus, I have same memory issue. GTX650 gpu

cahya-wirawan commented 7 years ago

Instead of increasing the batch size, try to reduce the batch size from the default 64 to 32 or smaller until it doesn't crash. If it crash also during the evaluation, try to reduce the dev_sample_percentage from 0.1 to 0.05 or lower.

dinara92 commented 7 years ago

I have been decreasing batch size, even up to 10, but it still seems like bigger files with word embeddings (like GoogleNews) can't be loaded. I have tried with glove 100 d, and it works. When I try to run 20 newsgroup however, with any type of embedding, it is giving me this error:

W tensorflow/core/common_runtime/bfc_allocator.cc:275] Ran out of memory trying to allocate 200.0KiB. See logs for memory state. W tensorflow/core/framework/op_kernel.cc:993] Resource exhausted: OOM when allocating tensor with shape[128,1,4,100]

stevewyl commented 7 years ago

@cahya-wirawan Thanks for your hints! Finally, I successfully run the code on GTX1080 8GB with small batch size and a small percentage of dev_samples. So I'd like to ask that if I want to use pre-trained word embeddings smoothly, what's the requirements of the devices?

stevewyl commented 7 years ago

@cahya-wirawan In NB and SVM experiment, I found that you used all the 20newsgroups data, then I tried to use the same data as you used in CNN experiment, the SVM classification got the nearly same results as CNN model with word embeddings. Maybe CNN with word embeddings can outperform SVM model on other datasets.

cahya-wirawan commented 7 years ago

Actually I used only 4 newsgroups (alt.atheist, com.graphics, sci.med and soc.religion.christian) for my comparison above. I trained also with all 20 newsgroups, and the accuracy of all classifiers decreased, but with similar order (cnn get still the best accuracy and bayes the worst).

stevewyl commented 7 years ago

@cahya-wirawan You're right! I tried 6 categories classification and found that CNN with word embeddings start to outperform SVM. Thanks for your immediate reply! : )

skullbone20 commented 7 years ago

@cahya-wirawan Thank you for sharing the code! However I have a problem with using .txt word2vec files. I get the following error:

Load word2vec file data/embeddings/word2vec/GoogleNews-vectors-negative300.txt Traceback (most recent call last): File "train.py", line 168, in cfg['word_embeddings']['word2vec']['binary']) File "C:\src\Users\JeroenB\cnn-text-classification-tf-master-embeddings\cnn-text-classification-tf-master\data_helpers.py", line 150, in load_embedding_vectors_word2vec word, vector = parts[0], list(map('float32', parts[1:])) TypeError: 'str' object is not callable

Any idea what the issue might be?

queirozfcom commented 7 years ago

Were you guys able to reproduce the results (95% validation accuracy after only 2 epochs) from the tutorial on document classification w/ pretrained word embeddings? Lots of people (myself included) can't seem to reproduce those.

cahya-wirawan commented 7 years ago

I checked again my trained data for above comparison using tensorboard, and according to the chart below, I reached the accuracy of between 0.6 and 0.7 after 2 epochs. I trained only 4 newsgroups from 20 with the default batch size of 64, data length of 2032, that means 2032/64=31 steps/epoch ( I hope my calculation is correct :) ) . Also I doubt that we can reach accuracy of 95% after just 2 epochs. According to the chart, I reached 95% accuracy after 7 epochs.

aksharma90 commented 7 years ago

@cahya-wirawan how to get probability values for each output class? Please help me out with this.

cahya-wirawan commented 7 years ago

@aksharma90 The probability can be calculated from the score using softmax function. I added the probability of each output in my latest commit (changes in eval.py). The output of prediction and it's probability are saved in prediction.csv file.

CMWENLIU commented 7 years ago

Hi @cahya-wirawan Thank you so much for the functionality of multiclass classification you did. I still have issues when loading my own local data, after following I did:

1, saved text files in 5 categories as subfolder names in the folder: /data/bbcdata and there are 5 folders with corresponding txt files in bbcdata folder: "business","entertainment","politics","sport","tech" 2, updated the config.yml file as following

line 16: default: localdata
line 52: container_path: "/data/bbcdata"
Did I missed something to run the ./train.py

Could you help me about that? Thank you so much!

Aven

cahya-wirawan commented 7 years ago

Which issues did you get?

boonkhai commented 7 years ago

Hi @cahya-wirawan , I got the following errors:

Load word2vec file data/GoogleNews-vectors-negative300.bin/GoogleNews-vectors-negative300.bin word2vec file has been loaded 2017-06-29 15:47:19.195688: E c:\tf_jenkins\home\workspace\release-win\device\gpu\os\windows\tensorflow\stream_executor\cuda\cuda_driver.cc:1037] failed to synchronize the stop event: CUDA_ERROR_LAUNCH_FAILED 2017-06-29 15:47:19.196050: E c:\tf_jenkins\home\workspace\release-win\device\gpu\os\windows\tensorflow\stream_executor\cuda\cuda_timer.cc:54] Internal: error destroying CUDA event in context 000001F766A5DE00: CUDA_ERROR_LAUNCH_FAILED 2017-06-29 15:47:19.196369: E c:\tf_jenkins\home\workspace\release-win\device\gpu\os\windows\tensorflow\stream_executor\cuda\cuda_timer.cc:59] Internal: error destroying CUDA event in context 000001F766A5DE00: CUDA_ERROR_LAUNCH_FAILED 2017-06-29 15:47:19.196738: F c:\tf_jenkins\home\workspace\release-win\device\gpu\os\windows\tensorflow\stream_executor\cuda\cuda_dnn.cc:2478] failed to enqueue convolution on stream: CUDNN_STATUS_EXECUTION_FAILED

What are the issues here and what are the possible solutions? Thanks

cahya-wirawan commented 7 years ago

The GPU issue in tensorflow has often something todo with its low memory. So try to reduce the batch size from the default 64 to 32 or smaller. If it has also problem during the evaluation, try to reduce the dev_sample_percentage from 0.1 to 0.05 or lower.

raziehaskari commented 7 years ago

Hi Is there anyone that works with Persian data with word2vec or Doc2vec?

cahya-wirawan commented 7 years ago

Hi @queirozfcom I checked the code again to get a high accuracy faster. It seems that this is possible if we use a higher learning rate. I added a dynamic learning rate with a high learning rate at the beginning to get high accuracy faster. The learning rate decay exponentially from 0.003 to 0.0001. After using a dynamic learning rate, I can get also an accuracy of 95% after just 2 epochs (or even earlier)

cahya-wirawan commented 7 years ago

Hi @raziehaskari maybe you can check this link https://sites.google.com/site/rmyeid/projects/polyglot

prateekverma1 commented 7 years ago

@skullbone20 I changed word, vector = parts[0], list(map('float32', parts[1:])) to word, vector = parts[0], list(map(np.float32 parts[1:])) and that worked for me.

wiedersehne commented 7 years ago

Hello, Thank s for your sharing! I have a problem in my experiment. My traininig data and testing data are independent files, how can I use the pretrained w2v?

luisfredgs commented 6 years ago

@cahya-wirawan thank you so much for had added calculus of probability 👍 .

AritzBi commented 6 years ago

Hello everyone! First of all thank you for sharing!

I was wondering if someone has worked with CNNs with Word2Vec and SMV+Tf-idf for text classification. I've been applying both approaches in a dataset and both approaches give similar results. In fact, the best f1 (macro) is with SMV+Tf-idf. Does this make sense to any of you? Thank you very much!

bhardwaj-gopika commented 6 years ago

I reduced my batch size as well as the percentage I am still getting this error

Traceback (most recent call last): File "./train.py", line 170, in cfg['word_embeddings']['word2vec']['binary']) File "/home/gopika/Downloads/cnn-text-classification-tf-master/data_helpers.py", line 119, in load_embedding_vectors_word2vec with open(filename, "rb") as f: IOError: [Errno 2] No such file or directory: '../../data/input/word_embeddings/GoogleNews-vectors-negative300.bin'

What is supposed to be in the place of 'filename'?

gigglegrig commented 6 years ago

I think the accuracy been mentioned here is the result with metadata retained. If removing the metadata like header, footer and subject, the accuracy is much lower.

mahsaabazary commented 6 years ago

I have the same problem as bhardwaj-kopiga had: Traceback (most recent call last): File "./train.py", line 170, in cfg['word_embeddings']['word2vec']['binary']) File "/home/gopika/Downloads/cnn-text-classification-tf-master/data_helpers.py", line 119, in load_embedding_vectors_word2vec with open(filename, "rb") as f: IOError: [Errno 2] No such file or directory: '../../data/input/word_embeddings/GoogleNews-vectors-negative300.bin' what can i do with this?

cahya-wirawan commented 6 years ago

@mahsaabazary @bhardwaj-gopika GoogleNews-vectors-negative300.bin is the pre-trained word2vec from google, it should be downloaded separately from: https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing

jagganaidu commented 6 years ago

Hi Cahya: Thanks for this great article and Github code. You sure are making our lives a lot easier and we couldn't be any more grateful to you for doing so. I have a couple of questions:

How are you dealing with spelling mistakes?
In the Evaluation, if we encounter new words (that are not in the training data), the model is not recognizing it. Is there a way we can use pre-trained word embedding for those words? For example, I see deer in training but tiger in evaluation, the model is not recognizing that it's an animal after all. Please help if you can. Thanks, Jagan

aadhilr commented 6 years ago

@cahya-wirawan Thanks for your great work. I have the same set of questions that @jagganaidu posted above. Can you please clarify?

antriksh63 commented 6 years ago

@cahya-wirawan @CMWENLIU I have a little trouble understanding the directory structure for multi-class classification. Do the names of subfolders have to be names of the classes or they can be numbered like 1,2....? And in the categories section in the yml file, do we need to add these folder names there?

martian07 commented 6 years ago

thanks for sharing!! helped me a lot..

usmaann commented 5 years ago

2032

how did u find the data length (2032)?

working12 commented 5 years ago

Actually I used only 4 newsgroups (alt.atheist, com.graphics, sci.med and soc.religion.christian) for my comparison above. I trained also with all 20 newsgroups, and the accuracy of all classifiers decreased, but with similar order (cnn get still the best accuracy and bayes the worst).

@cahya-wirawan What accuracy did you get with all 20 newsgroups? For CNN and for SVM, Bayes? Can you please quote the accuracy percentages?

dennybritz / cnn-text-classification-tf

Multiclass classification and pre-trained word embedding (word2vec & GloVe) support and it's comparison #69