charlesashby / CharLSTM

Bidirectional Character LSTM for Sentiment Analysis - Tensorflow Implementation
MIT License
49 stars 15 forks source link

ValueError: all the input arrays must have same number of dimensions #8

Open andresiggesjo opened 6 years ago

andresiggesjo commented 6 years ago

Hello, so i get the error on both my laptop and my desktop although at different times. On my laptop it occurs after the first epoch, and while using a smaller sample of the dataset and adjusting the variables i managed to get to the validate epoch multiple times before crashing. However on my desktop it crashes during the first epoch after a couple of minutes.

"Fever broke in the middle of the night, but its rising again " "everything is starting to fall into place, busy week coming up ugh and cousins grad! " Traceback (most recent call last): File "main.py", line 23, in network.train() File "/home/ando/azam/CharLSTM/lib_model/char_lstm.py", line 172, in train for minibatch in reader.iterate_minibatch(BATCH_SIZE, dataset=TRAIN_SET): File "/home/ando/azam/CharLSTM/lib/data_utils.py", line 197, in iterate_minibatch inputs, targets = self.make_minibatch(self.data) File "/home/ando/azam/CharLSTM/lib/data_utils.py", line 167, in make_minibatch minibatch_x = numpy_fillna(minibatch_x) File "/home/ando/azam/CharLSTM/lib/data_utils.py", line 163, in numpy_fillna out[mask] = np.concatenate(data) ValueError: all the input arrays must have same number of dimensions

charlesashby commented 6 years ago

ok thanks @andresiggesjo I'll try to understand what's going on!

charlesashby commented 6 years ago

I think there was a problem with the dataset, I'll try to do a proper commit this week but for the moment this code should clean it; some numbers could not be read for some reasons, let me know if you have any issues!

data = []
f = open(TRAIN_SET, 'r')
lines = f.readlines()

for i, l  in enumerate(lines):
    print(i)
    try:
        if int(l.split(',')[0]) != 4 and int(l.split(',')[0]) != 0:
            print(l)
        else:
            data.append(l)
    except ValueError:
        pass

# Save "data" to some file 
andresiggesjo commented 6 years ago

Hi charles, thanks for the quick solution. Neither the if statement nor the else clause fires for me, the data array is empty and print(l) never happends. I put the code in the shuffle_datasets function to try to clean the data before creating the train_set, valid_set, test_set files. Did i misinterpret something?

charlesashby commented 6 years ago

The "fix" is to delete the sentences that are encoded in a weird format and that cannot be processed for some reasons.. Run the code on the TRAIN_SET, VALID_SET and TEST_SET after shuffling them

It should remove ~1000 sentences

andresiggesjo commented 6 years ago

@charlesashby Alright, i got way further than i've gotten before but i still crashed the same. The datasets are smaller so it did remove the lines without 0 or 4 sentiment.

Epoch: 1/ 500 -- batch: 1600/23750 -- Loss: 44.3850 -- Train Accuracy: 0.6094 Traceback (most recent call last): File "main.py", line 23, in network.train() File "/home/ando/ando/CharLSTM/lib_model/char_lstm.py", line 168, in train for minibatch in reader.iterate_minibatch(BATCH_SIZE, dataset=TRAIN_SET): File "/home/ando/ando/CharLSTM/lib/data_utils.py", line 239, in iterate_minibatch

File "/home/ando/ando/CharLSTM/lib/data_utils.py", line 209, in make_minibatch if self.load_to_ram(batch_size): File "/home/ando/ando/CharLSTM/lib/data_utils.py", line 205, in numpy_fillna

ValueError: all the input arrays must have same number of dimensions

andresiggesjo commented 6 years ago

I changed the dataset to another 1.5M tweet dataset from sentiment140 and it worked for around 10 hours before crashing the same way. It got to batch 17900 on epoch 1 before crashing.

charlesashby commented 6 years ago

Yeah it's definitely a problem with the encoding.. Try to go through every sentences in the dataset before training it and remove the weirdly formatted one it should work fine afterward!

andresiggesjo commented 6 years ago

How can i spot the wrongly encoded sentences? I'm already removing all the ones not starting with 0 or 4.

charlesashby commented 6 years ago

Try to print them haha or do any kind of operation on them, if it works, you should not have any problem during training

andresiggesjo commented 6 years ago

Alright, ill give it a try. By the way im thinking of trying this on a swedish dataset i have but i would need to be able to classify neutral tweets as well. I saw a comment on your blog where you said its as easy as changing the output from 2 to 3, but i had some trouble finding exactly where to change the output hehe. Thanks for all the help!

charlesashby commented 6 years ago

Yep, I think you would only need to change:

# lib_model/char_lstm.py
self.Y = tf.placeholder('float32', shape=[None, 3], name='Y')
self.prediction = softmax(last, 3)

You might have to do some modification in the minibatch functions also

You're welcome!

andresiggesjo commented 6 years ago

Hello charles, i reworked the code somewhat to work in python3 to make it easier to use a swedish data set and i have a question regarding this code from encode_one_hot in data_utils.py

encoded_sentence = filter(lambda x: x in (printable), sentence) for word in word_tokenize(encoded_sentence.decode('utf-8', 'ignore').encode('utf-8')**) I guess im having some trouble understanding what the filter(lambda...) does? When printing out the both of them and comparing, they seem to output the same thing more or less?

Thanks!

charlesashby commented 6 years ago

Hey! Take a look at http://book.pythontips.com/en/latest/map_filter.html, basically it creates a list element when the function returns True, in our case it deletes non printable characters

andresiggesjo commented 6 years ago

Oh alright, thanks mate! I do have one last question though haha. I get vastly different results from evaluate_test_set() and the validation test inside of train, i've made sure im running valid_set on both of them. Is this intended? I got 95-100% validation accuracy when checking with the validation part inside train, but when running evaluate_test_set on that saved model i only get around 70-75% validation accuracy.

charlesashby commented 6 years ago

@andresiggesjo Weird.. I'll try to find out what's going on when I have some free time!

camer314 commented 6 years ago

I had all sorts of trouble with this error as well and spent an eternity trying various cleans of the data, each time getting me a little further but it never fully worked, the only way i could get it to work to completion was to print every row in the load_to_ram function and if it got an exception printing then ignoring that row.

Eventually i went back to the drawing board and took the original stanford input dataset and cleaned it using a C# program instead of python, this resulted in 16 rows being removed. Using that cleaned CSV file as my training set it now works.

btw i also converted to Python 3 in the process.

andresiggesjo commented 6 years ago

After doing some testing, i found that it crashes on tweets that are one character. For example a single number or a letter. These lines work fine printing but the algorithm has some problem with them anyway.

camer314 commented 6 years ago

That's true but also it crashes if the sentence becomes empty after the filter.

As a side note, the DICT contains a mapping of the alphabet but only lowercase, for uppercase letters it gets a lookup error and ends up passing in the exception handler. I assume this is a bug, I convert A-Z to lowercase first

andresiggesjo commented 6 years ago

Ah alright, i will do that as well. Thanks!

andresiggesjo commented 6 years ago

@charlesashby I really want to thank you for your tutorial and this code, it has helped me loads in my project! I got it working properly with a large swedish dataset and achieved similiar accuracy to yours, 80-84% on the test set.

I do have one final question though, how many hidden layers are in the lstm? And how many nodes are in each layer? I assume that the final layer has 650 nodes which is the rnn size, but what about the hidden ones?

Thank you again for the tutorial and the code!

camer314 commented 6 years ago

Did you look at the graph in Tensorboard?

I have saved to HTML, view this in Chrome, you should be able to navigate the model and see all the nodes from within the browser, if this is what you were after...

https://wtwdeeplearning.blob.core.windows.net/temp/graph_def.html?st=2018-04-04T05%3A05%3A00Z&se=2018-05-05T05%3A05%3A00Z&sp=rl&sv=2015-12-11&sr=b&sig=YjeekVdXVYp1JNAZ6xUqcUQLM%2BVio5Y%2FzxFjzsoiQnk%3D

andresiggesjo commented 6 years ago

Yeah i did look at it in tensorboard but i was more talking about something like this -> http://konstilackner.github.io/LSTM-RNN-Melody-Composer-Website/images/LSTMRNNNetworkTopology.png

More specifically im interested in how many hidden layers are in the lstm.

charlesashby commented 6 years ago

@andresiggesjo Glad I could help you! About your question, typically, LSTMs have only 1 layer that gets unrolled (in our case, it gets unrolled for sentence_length steps), in the bidirectional LSTM, however, our data goes through the first LSTM layer and is than fed to a 2nd LSTM layer (if I remember correctly, both of these layers have 650 neurons/nodes)

andresiggesjo commented 6 years ago

@charlesashby Alright thanks! By the way sorry for carrying on the conversation in the issues page. I've been playing around with the learning rate and using different decay techniques and trying different datasets, the accuracy on train/val/test all seem to plane out at around 81-83%, do you think this is because of the architecture or maybe that the datasets needs a neutral class as well?

charlesashby commented 6 years ago

@andresiggesjo You're welcome! I think you could get better results if you trained it on more data, you could try gathering more tweets using Twitter's API and filtering the one with :) / :( emoticons as positive and negative. However, the main problem (I think) with the model is sarcasm, maybe adding more examples of sarcasm could improve your results? Let me know if you decide to try it out!

monajalal commented 6 years ago

Following the recommendations here I get the following error

[jalal@goku CharLSTM]$ python main.py bidirectional_lstm --train
Using model: bidirectional_lstm
Training: True
2018-04-19 21:53:11.470158: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2018-04-19 21:53:11.470202: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2018-04-19 21:53:11.470259: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2018-04-19 21:53:11.470266: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2018-04-19 21:53:11.470273: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
WARNING:tensorflow:From /scratch2/debate_tweets/sentiment/CharLSTM/lib_model/bidirectional_lstm.py:158: initialize_all_variables (from tensorflow.python.ops.variables) is deprecated and will be removed after 2017-03-02.
Instructions for updating:
Use `tf.global_variables_initializer` instead.
",", ,w,h,y, ,d,o,e,s, ,i,t, ,l,o,o,k, ,l,i,k,e, ,i,t,',s, ,g,o,n,n,a, ,r,a,i,n, ,t,o,d,a,y,?, ,I, ,n,e,e,d, ,t,o, ,g,o, ,c,h,u,r,c,h,!, ,J,E,S,U,S, ,L,O,V,E,S, ,Y,O,U,!,!,!,!,!,"

",",i,s, ,w,a,i,t,i,n,g, ,f,o,r, ,K,a,t,i,e, ,t,o, ,g,e,t, ,h,e,r,e, ,t,o, ,g,o, ,s,e,e, ,J,a,d,e, ,a,n,d, ,g,o, ,s,h,o,p,p,i,n,g, ,"

",",t,h,a,t, ,',t,h,e, ,h,o,r,n,y, ,k,i,t,t,y,',r,e,a,l,l,y, ,w,a,n,t,s, ,t,o, ,f,o,l,l,o,w, ,m,e, ,d,o,n,t, ,s,h,e,!, , ,l,o,l, ,c,r,a,z,y, ,a,s,s, ,b,o,t, ,f,o,l,l,o,w,e,r,s,!,"

",","""",@,N,i,a,B,a,s,s,e,t,t, ,B,a,r,e,l,y, ,t,h,e,r,e,",", ,h,u,h,?,!, , ,h,a,"""","

",",O,h, ,n,o,.,.,.,M,a,r,t,i,n, ,K,a,y,m,e,r, ,j,u,s,t, ,b,o,g,i,e,d, ,t,h,e, ,1,1,t,h,!,!, ,H,o,p,e, ,i, ,h,a,v,e,n,',t, ,j,i,n,x,e,d, ,h,i,m,!,!, ,O,o,o,p,s, ,"

",",#,3,s,t,a,l,k,e,r,w,o,r,d,s, ,I,',L,L, ,B,E, ,W,A,I,T,I,N,G, ,"

",",@,v,e,r,i,t,o,_,s,t,a,r, , ,o,n,l,y, ,@,n,i,c,k,_,c,a,r,t,e,r, ,;,), ,a,l,w,a,y,s, ,b,l,a,m,e, ,n,i,c,k, ,"

",","""",@,r,e,p,l,y, ,T,h,a,n,k,s, ,f,o,r, ,a,l,l, ,t,h,e, ,a,w,e,s,o,m,e, ,s,e,m,i,n,a,r,s,",", ,l,e,c,t,u,r,e,s,",", ,k,e,y,n,o,t,e,s, ,a,n,d, ,o,t,h,e,r, ,j,u,n,k, ,#,o,p,e,n,v,i,d,e,o,., ,S,o,",",y, ,I, ,h,a,d, ,t,o, ,b,a,i,l, ,b,e,f,o,r,e, ,t,h,e, ,t,h,e, ,l,a,s,t, ,t,h,i,n,g,!, ,"""","

",",@,A,r,t,M,i,n,d, ,i, ,d,i,d,n,',t, ,s,e,e, ,y,o,u,r, ,f,e,a,t,u,r,e, ,D,:, ,b,u,t, ,i, ,s,u,r,e, ,d,o, ,l,o,v,e, ,t,h,a,t, ,s,i,t,e,!, ,t,h,a,n,k,s, ,f,o,r, ,s,h,a,r,i,n,g,!,!, ,"

",",@,K,r,y,s,t,a,l,L,a,R,a,e, ,t,h,a,t, ,m,u,s,t, ,b,e, ,b,a,d, ,f,o,r, ,y,o,u,., ,"

",",M,e,h,., ,T,o,t,a,l,l,y, ,c,b,a, ,w,i,t,h,.,w,o,r,k, ,f,o,r, ,4, ,h,o,u,r,s, ,"

",",j,u,s,t, ,s,h,o,o,s,h, ,"

",","""",T,o, ,g,e,t, ,y,o,u,r,s,",", ,g,o, ,t,o, ,w,w,w,.,A,n,n,u,a,l,C,r,e,d,i,t,R,e,p,o,r,t,.,c,o,m, , ,"""","

",",@,d,j,m,o,b,e,a,t,z, ,N,o,t, ,d,r,e,s,s,e,d, ,f,o,r, ,i,t, , ,i,l,l, ,b,e, ,o,u,t, ,n,e,x,t, ,w,e,e,k, ,t,h,o,"

",","""",@,e,m,o,_,z,a,b,o,o, ,u,g,h,.,.,.,s,o,r,r,y, ,a,b,o,u,t, ,y,o,u,r, ,f,i,n,g,e,r, , ,y,e,a,",", ,y,o,u, ,s,h,o,u,l,d, ,r,e,a,l,l,y, ,f,o,l,l,o,w, ,m,a,t,t,_,t,u,c,k, ,o,n, ,t,w,i,t,t,e,r,.,",",.,i,',m, ,l,m,a,o, ,r,i,g,h,t, ,n,o,w,.,.,.,h,e,',s, ,s,o,o,o, ,f,u,n,n,y,!,"""","

",",@,m,a,c,t,a,v,i,s,h, ,l,a,w,n,s, ,a,r,e, ,a, ,h,o,r,r,i,b,l,e, ,u,s,e, ,o,f, ,r,e,s,o,u,r,c,e,s, , ,d,o, ,t,h,e,y, ,e,v,e,n, ,h,e,l,p, ,w,i,t,h, ,t,h,e, ,a,i,r, ,m,u,c,h,?,"

",",S,o,m,e,o,n,e, ,h,e,l,p, ,m,e, ,f,i,n,d, ,a, ,D,&,a,m,p,;,D, ,t,a,b,l,e, ,p,l,e,a,s,e, ,"

",","""",i,s, ,g,o,i,n,g, ,t,o, ,t,h,e, ,s,h,o,p,s, ,n,o,w,",", ,w,i,t,h, ,b,i,g, ,s,u,n, ,g,l,a,s,s,e,s, ,t,o, ,h,i,d,e, ,m,y, ,e,y,e,!, , ,*,h,u,m,m,p,h,!,*,"""","

",",@,v,u,l,c,a,n,s,t,e,v, ,.,.,.,A,n,d,.,.,.,c,o,m,m,e,n,t,e,d,., ,"

",",@,I,a,m,M,a,x,a,t,H,o,t,S,p,o,t, ,S,h,e,',l,l, ,b,e, ,a, ,b,i,l,l,i,o,n,a,i,r,e, ,s,o,m,e, ,d,a,y,!, ,"

",",@,R,u,m,f,o,r,d, ,i,t,',s, ,b,e,e,n, ,a, ,b,e,a,u,t,i,f,u,l, ,d,a,y, ,i,n, ,c,e,n,t,r,a,l, ,#,I,n,d,i,a,n,a, , ,n,e,w, ,p,o,s,t, ,a,t, ,h,t,t,p,:,/,/,I,n,d,y,S,o,c,i,a,l,M,e,d,i,a,.,c,o,m,"

",",S,i,t,t,i,n,g, ,i,n, ,t,h,e, ,r,e,d, ,r,o,o,m, ,t,h,i,n,k,i,n,g,.,.,.,i, ,l,e,f,t, ,m,y, ,i,p,o,d, ,a,t, ,h,o,m,e, ,"

",",@,S,t,u,s,h,m,u,s,i,c, ,w,h,a,t,',s, ,u,p, ,w,i,t,h, ,t,h,e, ,s,e,s,s,i,o,n, ,t,o,d,a,y,?,@,m,u,s,i,c,m,y,s,t,r,o, ,w,h,e,n, ,i,s, ,t,h,e, ,n,e,x,t, ,e,v,e,n,t,?,@,r,e,m,e,m,b,e,r,m,e,n,i,n,a,b","I, ,s,p,o,k,e, ,t,o, ,a,m,b,e,r, ,l,a,s,t, ,n,i,g,h,t, ,"

",","""",@,C,h,u,b,b,x, ,G,o,o,o,o,o,o,o,d, ,m,o,r,n,i,n,g,",", ,C,h,u,b,b,x, , ,h,o,w, ,a,r,e, ,y,o,u, ,t,o,d,a,y,?,"""","

",",A,p,o,l,o,g,y,!, ,W,o,w,.,.,., ,r,i,g,h,t,l,y, ,s,o,., ,W,e,a,t,h,e,r,s, ,g,o,o,d, ,n,o,w, ,"

",","""",@,t,h,e,I,I,I, ,o,f, ,c,o,u,r,s,e,",", ,a,n,d, ,i, ,a,m, ,t,h,e, ,m,e,t,s, ,g,o,o,d, ,l,u,c,k, ,c,h,a,r,m,",", ,w,h,e,n, ,i, ,g,o, ,M,e,t,s, ,w,i,n, ,"""","

",","""",L,a,s,t, ,n,i,g,h,t, ,s,t,u,d,i,o, ,t,i,m,e, ,r,a,n, ,l,a,t,e, , ,B,U,T, ,w,e, ,h,a,v,e, ,t,h,e, ,p,r,e,-,p,r,o,d,u,c,t,i,o,n,",", ,n,o,w, ,m,a,s,t,e,r,i,n,g, ,a,n,d, ,m,i,x,i,n,g,!,"""","

",",@,a,m,i,t,p,r,a,s,a,d, ,t,w,i,t,t,e,r, ,p,e, ,t,o,h, ,n,a,h,i,n, ,m,i,l,e,g,i, ,t,u,m,h,e, ,.,., ,t,r,y, ,s,u,l,e,k,h,a,.,c,o,m, ,"

",","""",@,o,h,u,g,i,r,l,0,9, , ,O,h,",", ,I, ,l,i,k,e, ,t,h,a,t, , ,r,a,t,i,o,n,a,l,e,., ,I, ,t,h,i,n,k, ,I, ,s,h,a,l,l, ,u,s,e, ,i,t,!, ,"""","

",","""",@,M,c,C,a,i,n,B,l,o,g,e,t,t,e, ,M,e,g,h,a,n, ,p,l,e,a,s,e,",", ,p,l,e,a,s,e,",", ,c,o,m,e, ,o,u,t, ,a,g,a,i,n,s,t, ,t,h,e, ,e,x,t,r,e,m,e, ,r,i,g,h,t, ,w,i,n,g, ,y,o,u, ,h,a,v,e, ,a, ,v,o,",",e,., ,G,u,y,s, ,l,i,k,e, ,O,r,i,e,l,l,y, ,g,o,t, ,t,h,i,s, ,m,a,n, ,k,i,l,l,e,d, ,"""","

",",@,K,y,n,g,A,l,i,e,n, ,I, ,h,a,v,e,!,!, ,G,o,o,d, ,t,i,m,e,s, ,"

",","""",@,s,e,a,n,m,c,g,i,n,l,e,y, ,s,h,e, ,i,s, ,s,o,o, ,m,a, ,d,e,a,r, , ,",", ,h,a,h,a,h,a, ,i,m, , ,j,u,s,t, ,t,h,e, ,b,e,s,t, ,!, ,+,"""","

Traceback (most recent call last):
  File "main.py", line 33, in <module>
    network.train()
  File "/scratch2/debate_tweets/sentiment/CharLSTM/lib_model/bidirectional_lstm.py", line 170, in train
    for minibatch in reader.iterate_minibatch(BATCH_SIZE, dataset=TRAIN_SET):
  File "/scratch2/debate_tweets/sentiment/CharLSTM/lib/data_utils.py", line 247, in iterate_minibatch
    inputs, targets = self.make_minibatch(self.data)
  File "/scratch2/debate_tweets/sentiment/CharLSTM/lib/data_utils.py", line 217, in make_minibatch
    minibatch_x = numpy_fillna(minibatch_x)
  File "/scratch2/debate_tweets/sentiment/CharLSTM/lib/data_utils.py", line 213, in numpy_fillna
    out[mask] = np.concatenate(data)
ValueError: all the input arrays must have same number of dimensions
[jalal@goku CharLSTM]$ 
monajalal commented 6 years ago

Can you please share the correct CharLSTM/lib/data_utils.py file? Thanks @andresiggesjo @charlesashby

andresiggesjo commented 6 years ago

@monajalal Hi Mona, i've changed quite a bit of the code to suit my needs but the trouble lies in the dataset, so i would suggest either creating your own dataset (like i did) or try to find another dataset online.