brightmart / text_classification

all kinds of text classification models and more with deep learning
MIT License
7.86k stars 2.57k forks source link

p7_TextCNN_train.py doesn't assign embedding for word_embedding_2dlist #105

Open koncle opened 5 years ago

koncle commented 5 years ago

In p7_TextCNN_train.py :

def assign_pretrained_word_embedding(sess,vocabulary_index2word,vocab_size,textCNN,word2vec_model_path):
    import word2vec # we put import here so that many people who do not use word2vec do not need to install this package. you can move import to the beginning of this file.
    print("using pre-trained word emebedding.started.word2vec_model_path:",word2vec_model_path)
    word2vec_model = word2vec.load(word2vec_model_path, kind='bin')
    word2vec_dict = {}
    for word, vector in zip(word2vec_model.vocab, word2vec_model.vectors):
        word2vec_dict[word] = vector
    word_embedding_2dlist = [[]] * vocab_size  # create an empty word_embedding list.
    word_embedding_2dlist[0] = np.zeros(FLAGS.embed_size)  # assign empty for first word:'PAD'
    bound = np.sqrt(6.0) / np.sqrt(vocab_size)  # bound for random variables.
    count_exist = 0;
    count_not_exist = 0
    for i in range(2, vocab_size):  # loop each word. notice that the first two words are pad and unknown token
        word = vocabulary_index2word[i]  # get a word
        embedding = None
        try:
            embedding = word2vec_dict[word]  # try to get vector:it is an array.
        except Exception:
            embedding = None
        if embedding is not None:  # the 'word' exist a embedding
            word_embedding_2dlist[i] = embedding;
            count_exist = count_exist + 1  # assign array to this word.
        else:  # no embedding for this word
            word_embedding_2dlist[i] = np.random.uniform(-bound, bound, FLAGS.embed_size);
            count_not_exist = count_not_exist + 1  # init a random value for the word.
    word_embedding_final = np.array(word_embedding_2dlist)  # covert to 2d array.
    word_embedding = tf.constant(word_embedding_final, dtype=tf.float32)  # convert to tensor
    t_assign_embedding = tf.assign(textCNN.Embedding,word_embedding)  # assign this value to our embedding variables of our model.
    sess.run(t_assign_embedding);
    print("word. exists embedding:", count_exist, " ;word not exist embedding:", count_not_exist)
print("using pre-trained word emebedding.ended...")

The word_embedding_2dlist[1] doesn't get any embedding. The loop should begin from 1 to vocab_size.

Jonny-Smith-GitHub commented 4 years ago

Meet the similar problem. Value Error: setting an array element with a sequence.

TypeError: only size-1 arrays can be converted to Python scalars The above exception was the direct cause of the following exception: Traceback (most recent call last): File "D:\Anaconda3\envs\text_classification\lib\contextlib.py", line 99, in exit self.gen.throw(type, value, traceback) File "D:\Anaconda3\envs\text_classification\lib\site-packages\tensorflow\python\framework\ops.py", line 5253, in get_controller yield g File "D:/company/pycharm/text_classification/a02_TextCNN/p7_TextCNN_train.py", line 84, in main assign_pretrained_word_embedding(sess, index2word, vocab_size, textCNN, FLAGS.word2vec_model_path) File "D:/company/pycharm/text_classification/a02_TextCNN/p7_TextCNN_train.py", line 236, in assign_pretrained_word_embedding word_embedding = tf.constant(word_embedding_final, dtype=tf.float64,shape=(11982,)) # convert to tensor File "D:\Anaconda3\envs\text_classification\lib\site-packages\tensorflow\python\framework\constant_op.py", line 179, in constant_v1 allow_broadcast=False) File "D:\Anaconda3\envs\text_classification\lib\site-packages\tensorflow\python\framework\constant_op.py", line 283, in _constant_impl allow_broadcast=allow_broadcast)) File "D:\Anaconda3\envs\text_classification\lib\site-packages\tensorflow\python\framework\tensor_util.py", line 440, in make_tensor_proto nparray = values.astype(dtype.as_numpy_dtype) ValueError: setting an array element with a sequence.