Using pretrained word embeddings

sanjaymeena commented 8 years ago

@dennybritz

Hi , First of all many thanks for sharing your code. I am trying to use pretrained word embeddings instead of randomly initialized word embedings based on the vocabulary size.

My pretrained word embedding is numpy array : ( N, 300, dtype=float.32) where N is indice of the word for which word embedding of dimension (300,) is stored.

However , unable to pass the step of
self.embedded_chars = tf.nn.embedding_lookup(W, self.input_x) because tensorflow only allows int32,int64 as lookup indices.

Can you suggest how can i resolve this issue?

Many thanks

dennybritz commented 8 years ago

Sorry for the late reply, I was traveling. You need to set your pretrained embedding as the initializer of W when W is declared.

fmaglia commented 8 years ago

@sanjaymeena Can you share your code for the pre-training of word embeddings?

sanjaymeena commented 8 years ago

@fmaglia I am posting snippet of code for using pretrained word embeddings

from tensorflow.python.framework import ops
from tensorflow.python.ops import array_ops as array_ops_
from tensorflow.python.ops import math_ops
from tensorflow.python.ops import nn
from tensorflow.python.ops import variable_scope as vs

class TextCNN(object):
    """
    A CNN for text classification.
    Uses an embedding layer, followed by a convolutional, max-pooling and softmax layer.
    """
    def __init__(
      self, sequence_length, num_classes, vocab_size,word_vector_map,
      embedding_size, filter_sizes, num_filters, l2_reg_lambda=0.0,embedding_type='static'):

        # Placeholders for input, output and dropout
        self.input_x = tf.placeholder(tf.int32, [None, sequence_length], name="input_x")
        self.input_y = tf.placeholder(tf.float32, [None, num_classes], name="input_y")
        self.dropout_keep_prob = tf.placeholder(tf.float32, name="dropout_keep_prob")

        # Keeping track of l2 regularization loss (optional)
        l2_loss = tf.constant(0.0)

        #Embedding layer
        #with tf.device('/cpu:0'), tf.name_scope("embedding"):
        with tf.device('/cpu:0'), tf.name_scope("embedding"):

            # use pretrained word2vec embeddings 
            if embedding_type in 'static':
                ids=self.input_x
                embeddings = vs.get_variable('w2v' + "_embeddings",
                                         [num_classes, embedding_size])
                name="embedding_lookup"
                ids = ops.convert_to_tensor(ids)
                params=tf.Variable(word_vector_map)

                # shape 
                shape = array_ops_.shape(ids)

                # concatenates all the ids from all the sentences
                ids_flat = array_ops_.reshape(ids, math_ops.reduce_prod(shape, keep_dims=True))
                #
                embeds_flat = nn.embedding_lookup(params, ids_flat, name)
                embed_shape = array_ops_.concat(0, [shape, [-1]])
                embeds = array_ops_.reshape(embeds_flat, embed_shape)
                embeds.set_shape(ids.get_shape().concatenate(params.get_shape()[1:]))

                self.embedded_chars=embeds
                self.embedded_chars_expanded=tf.expand_dims(self.embedded_chars, -1)

            else:
                W = tf.Variable(
                    tf.random_uniform([vocab_size, embedding_size], -1.0, 1.0),
                    name="W")

                self.embedded_chars = tf.nn.embedding_lookup(W, self.input_x)
                self.embedded_chars_expanded = tf.expand_dims(self.embedded_chars, -1)

fmaglia commented 8 years ago

@sanjaymeena Thanks for the code. I have 3 questions: 1) TypeError: __init__() takes at least 8 arguments (9 given) In the train.py I add embedding_type in the class instantiation How I can solve this error? 2) In your code there are no steps in which word2vec access to a dataset for the pre-training. Why? 3) What you are using as dataset for the pre-training? Can you share it?

ferfervi commented 8 years ago

Hi @dennybritz Thank you for your sharing your code! I am having some issue trying to user word2vec as input:

I have a pretrained word2vec matrix "embedding_matrix" with one text sentence per row and n_dim(300) columns using gensim. "input_x" in each train step contains [n_batch_reviews, n_dimensions] in format word2vec .

I assign it to "W" after defining it. However I get an error when processing "tf.nn.embedding_lookup(W, self.input_x)"
I followed your indications assining the pretrained word2vec matrix using W.assign(). Please, let me know if you how to solve it. thanks in advance! (Please, note that I am not really familiar yet with TensorFlow)

   # Embedding layer
        with tf.device('/cpu:0'), tf.name_scope("embedding"):
           W = tf.Variable(tf.random_uniform([embedding_matrix.shape[0], embedding_size], -1.0, 1.0),name="W")
           W.assign(embedding_matrix)

            self.embedded_chars = tf.nn.embedding_lookup(W, self.input_x)
            self.embedded_chars_expanded = tf.expand_dims(self.embedded_chars, -1)

        # Create a convolution + maxpool layer for each filter size
        #...

Error:

tensorflow.python.framework.errors.InvalidArgumentError: indices[0,10] = -1 is not in [0, 50000) [[Node: embedding/embedding_lookup = Gather[Tindices=DT_INT32, Tparams=DT_FLOAT, _class=["loc:@embedding/W"], validate_indices=true, _device="/job:localhost/replica:0/task:0/cpu:0"](embedding/W/read, _recv_input_x_0)]] Caused by op 'embedding/embedding_lookup', defined at: File "cnn-text-classification-tf/text_cnn.py", line 46, in init self.embedded_chars = tf.nn.embedding_lookup(W, self.input_x)

Fjq commented 8 years ago

@sanjaymeena what's the parameter "word_vector_map" mean, what's the format of it ?

ngoduyvu commented 8 years ago

Hi guys, I have the same problem with you guy. I tried to apply word2vec for Convolution Neural Network in Text Classification task. The code for pre_train is "word2vec_basic.py" in Tensorflow website. I really don't know how to feed it in the network. Here is the code I tried to do:

 W = tf.Variable(tf.constant(0.0, shape=[vocabulary_size, embedding_size]),
                trainable=False, name="W")

embedding_placeholder = tf.placeholder(tf.float32, [vocabulary_size, embedding_size])
embedding_init = W.assign(embedding_placeholder)

How could we use embedding_lookup() function for it ?

j314erre commented 8 years ago

Here is how I load pre-trained word2vec into the model...I verified it gives the accuracy boost described in Yoon Kim's paper....YMMV

https://gist.github.com/j314erre/b7c97580a660ead82022625ff7a644d8

In text_cnn.py make W a self variable in TextCNN:


            **self.W** = tf.Variable(
                tf.random_uniform([vocab_size, embedding_size], -1.0, 1.0),
                name="W")
            self.embedded_chars = tf.nn.embedding_lookup(**self.W**, self.input_x)

In train.py add something like this after "# Initialize all variables"...basic idea is that during run time before you start training steps you can assign W to whatever you want...in this case from a word2vec file.


    sess.run(tf.initialize_all_variables())

        if FLAGS.word2vec:
            # initial matrix with random uniform
            initW = np.random.uniform(-0.25,0.25,(len(vocab_processor.vocabulary_), FLAGS.embedding_dim))
            # load any vectors from the word2vec
            print("Load word2vec file {}\n".format(FLAGS.word2vec))
            with open(FLAGS.word2vec, "rb") as f:
                header = f.readline()
                vocab_size, layer1_size = map(int, header.split())
                binary_len = np.dtype('float32').itemsize * layer1_size
                for line in xrange(vocab_size):
                    word = []
                    while True:
                        ch = f.read(1)
                        if ch == ' ':
                            word = ''.join(word)
                            break
                        if ch != '\n':
                            word.append(ch)   
                    idx = vocab_processor.vocabulary_.get(word)
                    if idx != 0:
                        initW[idx] = np.fromstring(f.read(binary_len), dtype='float32')  
                    else:
                        f.read(binary_len)    

            sess.run(cnn.W.assign(initW))

You'd need to match the embedding dimension and the word2vec dimension when you run it like this: --embedding_dim=300 --word2vec=GoogleNews-vectors-negative300.bin

ngoduyvu commented 8 years ago

Hi j314erre, thanks for your code but I am quite don't understand it. Are all of your result stored in "GoogleNews-vectors-negative300.bin" ? My code is similar with "word2vec_basic.py" of Tensorflow. Could I use "cnn.W.assign" like you ?

j314erre commented 8 years ago

My code example assumes pre-trained vectors are stored in the Google format and you can get GoogleNews-vectors-negative300.bin here

I believe for pre-trained vectors it makes more sense to load them from a file.

I haven't used word2vec_basic.py but it looks like the basic idea would be to assign 'final_embeddings' into session.run(cnn.W.assign(final_embeddings))

ngoduyvu commented 8 years ago

Yeah that is exactly what I did

`W = tf.Variable(tf.constant(0.0, shape=[vocabulary_size, embedding_size]), trainable=False, name="W")

embedding_placeholder = tf.placeholder(tf.float32, [vocabulary_size, embedding_size]) embedding_init = W.assign(embedding_placeholder)

sess = tf.Session()

sess.run(embedding_init, feed_dict={embedding_placeholder: final_embeddings})

embedded_chars = tf.nn.embedding_lookup(W, data) embedded_chars_expanded = tf.expand_dims(embedded_chars, -1)`

I used embedding_lookup for W and data (data is the variable store index of distionary) is It correct ? As @dennybritz wrote that "maps vocabulary word indices into low-dimensional vector representations"

wsywsya119 commented 8 years ago

Hi @j314erre ! Thanks for your code. I followed your code and fix train.py. But got a error with sess.run(cnn.W.assign(initW)) I had fix the embedding size to 300

The error code is Traceback (most recent call last): File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/tensor_shape.py", line 566, in merge_with new_dims.append(dim.merge_with(other[i])) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/tensor_shape.py", line 133, in merge_with self.assert_is_compatible_with(other) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/tensor_shape.py", line 108, in assert_is_compatible_with % (self, other)) ValueError: Dimensions 384 and 21678 are not compatible

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "train.py", line 201, in sess.run(cnn.W.assign(initW)) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/variables.py", line 497, in assign return state_ops.assign(self._variable, value, use_locking=use_locking) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/gen_state_ops.py", line 45, in assign use_locking=use_locking, name=name) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/op_def_library.py", line 704, in apply_op op_def=op_def) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 2262, in create_op set_shapes_for_outputs(ret) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 1702, in set_shapes_for_outputs shapes = shape_func(op) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/state_ops.py", line 209, in _AssignShape return [op.inputs[0].get_shape().merge_with(op.inputs[1].get_shape())] File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/tensor_shape.py", line 570, in merge_with (self, other)) ValueError: Shapes (384, 3) and (21678, 300) are not compatible

It seems initW can't mapping to self.w ?? or Is it have any possible to assign weight directly in this code? if idx != 0: initW[idx] = np.fromstring(f.read(binary_len), dtype='float32')

j314erre commented 8 years ago

Here is a gist of modified text_cnn.py and train.py that loads pre-trained word2vec https://gist.github.com/j314erre/b7c97580a660ead82022625ff7a644d8

Note: you have to set --embedding_dim to be the same size as your pre-trained embeddings.

Also note: it initializes the embedding with the pre-trained embeddings, but continues to learn and adapt the embeddings as you train further on your data...this is usually what you want to happen.

I ran it like this:

$ ./train.py --embedding_dim=300 --word2vec=./GoogleNews-vectors-negative300.bin

Parameters:
ALLOW_SOFT_PLACEMENT=True
BATCH_SIZE=64
CHECKPOINT_EVERY=100
DROPOUT_KEEP_PROB=0.5
EMBEDDING_DIM=300
EVALUATE_EVERY=100
FILTER_SIZES=3,4,5
L2_REG_LAMBDA=0.0
LOG_DEVICE_PLACEMENT=False
NUM_EPOCHS=200
NUM_FILTERS=128
WORD2VEC=/mnt/hgfs/Temp/deeplearning/data/word2vec/GoogleNews-vectors-negative300.bin

Loading data...
Vocabulary Size: 18758
Train/Dev split: 9662/1000
Writing to /mnt/hgfs/Temp/deeplearning/github/cnn-text-classification-tf-word2vec/runs/1468367719

Load word2vec file /mnt/hgfs/Temp/deeplearning/data/word2vec/GoogleNews-vectors-negative300.bin

2016-07-12T16:56:02.915101: step 1, loss 2.184, acc 0.484375
2016-07-12T16:56:03.463751: step 2, loss 1.05255, acc 0.5625
2016-07-12T16:56:03.786084: step 3, loss 0.961079, acc 0.515625
2016-07-12T16:56:04.096605: step 4, loss 1.14572, acc 0.5
2016-07-12T16:56:04.405554: step 5, loss 1.56523, acc 0.46875
2016-07-12T16:56:04.708008: step 6, loss 1.50058, acc 0.484375
2016-07-12T16:56:05.014675: step 7, loss 1.25756, acc 0.546875
2016-07-12T16:56:05.386628: step 8, loss 1.25465, acc 0.5625
2016-07-12T16:56:05.688880: step 9, loss 1.05234, acc 0.421875
2016-07-12T16:56:06.013079: step 10, loss 0.964042, acc 0.59375
.....

wsywsya119 commented 8 years ago

Hi @j314erre. I try your code to my fix version, it works!! Thank you very much!!

chaitjo commented 8 years ago

Hey @j314erre, Thanks for this! What changes should I make to the code if I do not want the embeddings to continue to be trained along with the model? (i.e. I want to use static word2vec embeddings)

j314erre commented 8 years ago

@chaitjo I have not tested this but try trainable=False when you define those parameters with tf.Variable:

# Embedding layer
        with tf.device('/cpu:0'), tf.name_scope("embedding"):
            self.W = tf.Variable(
                tf.random_uniform([vocab_size, embedding_size], -1.0, 1.0),
                trainable=False, ### NOT TRAINABLE
                name="W")
            self.embedded_chars = tf.nn.embedding_lookup(self.W, self.input_x)
            self.embedded_chars_expanded = tf.expand_dims(self.embedded_chars, -1)

cuixue commented 7 years ago

@ferfervi hello,I meet the same problem with you ,have you solved it?

andrisecker commented 7 years ago

@dennybritz, @j314erre

Hi, thanks for the code! I tried to run it with GloVe trained on the positive and negative examples, and I got an error from tf.nn.embedding_lookup(): `InvalidArgumentError (see above for traceback): indices[0,11] = 21161 is not in [0, 21161) [[Node: embedding/embedding_lookup = Gather[Tindices=DT_INT32, Tparams=DT_FLOAT, _class=["loc:@embedding/W"], validate_indices=true, _device="/job:localhost/replica:0/task:0/cpu:0"](embedding/W/read, _recv_input_x_0)]]

21161 is the number of the pretrained vectors (and I use a 21161 x embedding_size - matrix in the embedding layer for the weights)

in the documentation of the tf.nn.embedding_lookup_sparse() they mention: "It also assumes that all id values lie in the range [0, p0), where p0 is the sum of the size of params along dimension 0." Why exactly should the id value be less than the size of params along dimension 0? Is there any way to solve this?

hariom-yadaw commented 7 years ago

@dennybritz @j314erre what is advantage of pre-trainedembedding words ? I have dataset(not very large). In test/validation dataset, I have some sentences with same meaning/context but sentence is not using exact same words. Then in that case acuuracy is not good. can pre-trained embedding like GoogleNews-vectors-negative300.bin help in this ? How to solve this issue ? Please help. I hope you have lots of ideas on this.

Franck-Dernoncourt commented 7 years ago

An interesting post written by mrry on Using a pre-trained word embedding (word2vec or Glove) in TensorFlow:

There are a few ways that you can use a pre-trained embedding in TensorFlow. Let's say that you have the embedding in a NumPy array called embedding, with vocab_size rows and embedding_dim columns and you want to create a tensor W that can be used in a call to tf.nn.embedding_lookup().
Simply create W as a tf.constant() that takes embedding as its value:
W = tf.constant(embedding, name="W")
This is the easiest approach, but it is not memory efficient because the value of a tf.constant() is stored multiple times in memory. Since embedding can be very large, you should only use this approach for toy examples.
Create W as a tf.Variable and initialize it from the NumPy array via a tf.placeholder():
W = tf.Variable(tf.constant(0.0, shape=[vocab_size, embedding_dim]),
                trainable=False, name="W")

embedding_placeholder = tf.placeholder(tf.float32, [vocab_size, embedding_dim])
embedding_init = W.assign(embedding_placeholder)

# ...
sess = tf.Session()

sess.run(embedding_init, feed_dict={embedding_placeholder: embedding})
This avoid storing a copy of embedding in the graph, but it does require enough memory to keep two copies of the matrix in memory at once (one for the NumPy array, and one for the tf.Variable). Note that I've assumed that you want to hold the embedding matrix constant during training, so W is created with trainable=False.
If the embedding was trained as part of another TensorFlow model, you can use a tf.train.Saver to load the value from the other model's checkpoint file. This means that the embedding matrix can bypass Python altogether. Create W as in option 2, then do the following:
W = tf.Variable(...)

embedding_saver = tf.train.Saver({"name_of_variable_in_other_model": W})

# ...
sess = tf.Session()
embedding_saver.restore(sess, "checkpoint_filename.ckpt")

hariom-yadaw commented 7 years ago

Thanks a lot @Franck-Dernoncourt! how does pretrained embedding is different from having the models' own embedding(created during training the model) ? Currently I have relatively small dataset to train on my model and wanted to get it generalized on new test examples in sense that chatbot understand different sentences carrying same semantic meaning(intents and entities in it). For example:

How can you help me?; How can you assist me?; What assistance can you provide?; - Intent: help type
Make me laugh.; Crack some jokes please.; Can you make me laugh?; Can you tell me something funny?; Can you entertain me? Please entertain me. ----> intent: Entertainment

What kind of embedding can help me on the above problem? Thanks.

Franck-Dernoncourt commented 7 years ago

@hariom-yadaw using pre-trained word embeddings, which are fine-tuned during the training phase, sometimes helps, e.g. https://arxiv.org/pdf/1606.03475.pdf:

hariom-yadaw commented 7 years ago

@Franck-Dernoncourt Ok, Got it! Thanks. But again I have below two doubts

When we use pretrained embeddings(which should have millions of words), it works only for words which are in my dataset vocab(few thousands, suppose 1K). And I'm expecting new words(synonyms for words in my vocab) in my test set, its most likely to fail on those. How to tackle this?
We use pre-trained embeddings only when we rain the model or we use it during test? (I mean in serve mode- when chatbot is serving for the user)

pegadoflavian commented 7 years ago

Anyone facing a memory error while loading the word2vec model? Any suggestions on the amount of memory (RAM) i need to make sure things don't crash? I'm using the GoogleNews-vectors-negative300.bin for the word embeddings.

pldelisle commented 7 years ago

I was running the script eith Google's pre-trained word vectors on a server with 24 GB of RAM and pretty much all the RAM was filled up.

Psycho7 commented 7 years ago

For Python3, you should code like this:

        if FLAGS.word2vec:
            # initial matrix with random uniform
            initW = np.random.uniform(-0.25,0.25,(len(vocab_processor.vocabulary_), FLAGS.embedding_dim))
            # load any vectors from the word2vec
            print("Load word2vec file {}\n".format(FLAGS.word2vec))
            with open(FLAGS.word2vec, "rb") as f:
                header = f.readline()
                vocab_size, layer1_size = map(int, header.split())
                binary_len = np.dtype('float32').itemsize * layer1_size
                for line in range(vocab_size):
                    print(line)
                    word = []
                    while True:
                        ch = f.read(1).decode('latin-1')
                        if ch == ' ':
                            word = ''.join(word)
                            break
                        if ch != '\n':
                            word.append(ch)
                    print(word)
                    idx = vocab_processor.vocabulary_.get(word)
                    if idx != 0:
                        initW[idx] = np.fromstring(f.read(binary_len), dtype='float32')
                    else:
                        f.read(binary_len)
            sess.run(cnn.W.assign(initW))

And it really cost me a lot of time to figure it out... 😞

peveloper commented 7 years ago

Also running on a cluster and facing a MemoryError when loading the Google pre trained word vectors

junchaozheng commented 7 years ago

Use @Psycho7 code the memory error solved.

naveenjafer commented 6 years ago

@j314erre The code in the train.py that you have created in your gist seems to be using the load_data_and_labels function from data_helpers, but you have not fed the path of the positive_data_file and negative_data_file flags as it has been done in the original trian.py. Why?

naveenjafer commented 6 years ago

Anyone who is looking to Use Multi class classification with the latest versions of tensorflow with pre-trained Word2Vec, feel free to use this consolidated fork https://github.com/naveenjafer/cnn-text-classification-tf that builds on all the work done by the people here. Be sure to read the Readme carefully before proceeding.

hkhatod commented 6 years ago

Has anyone tried to do visulization on cnn classification ? Which variable are you using to visualize

monk1337 commented 6 years ago

I was facing same issue so I wrote detailed tutorial on that , check it out this notebook how to Use Pre-trained word_embedding in Tensorflow

https://github.com/monk1337/word_embedding-in-tensorflow/blob/master/Use%20Pre-trained%20word_embedding%20in%20Tensorflow.ipynb

dennybritz / cnn-text-classification-tf

Using pretrained word embeddings #17