charlesashby / CharLSTM

Bidirectional Character LSTM for Sentiment Analysis - Tensorflow Implementation
MIT License
49 stars 15 forks source link

When trying to train for BiLSTM ValueError: all the input arrays must have same number of dimensions from line 162 of data_utils.py #10

Open monajalal opened 6 years ago

monajalal commented 6 years ago

When I run: [jalal@goku CharLSTM]$ python main.py bidirectional_lstm --train I get the following error:

@buddhaqueen077    this is just not your day
@brwneyedbabe83  We got a last minute invite........alas not kid sitters 
Traceback (most recent call last):
  File "main.py", line 33, in <module>
    network.train()
  File "/scratch2/debate_tweets/sentiment/CharLSTM/lib_model/bidirectional_lstm.py", line 170, in train
    for minibatch in reader.iterate_minibatch(BATCH_SIZE, dataset=TRAIN_SET):
  File "/scratch2/debate_tweets/sentiment/CharLSTM/lib/data_utils.py", line 196, in iterate_minibatch
    inputs, targets = self.make_minibatch(self.data)
  File "/scratch2/debate_tweets/sentiment/CharLSTM/lib/data_utils.py", line 166, in make_minibatch
    minibatch_x = numpy_fillna(minibatch_x)
  File "/scratch2/debate_tweets/sentiment/CharLSTM/lib/data_utils.py", line 162, in numpy_fillna
    out[mask] = np.concatenate(data)
ValueError: all the input arrays must have same number of dimensions

How should this be fixed? I saw @andresiggesjo question https://github.com/charlesashby/CharLSTM/issues/8 and I was expecting the current repo to have the fix for it. Can you please guide?

RyanOngAI commented 6 years ago

Hi monajalal,

Not sure if you are still working on this but I believe the issue is due to the training dataset rather than the code itself. The code should work fine. After cleaning the data as suggested in question #8 , I realised that there are more NaN (on the text side), which I believe is what causing the error message above. There should be 5 more NaN on the text side after the cleaning process suggested by #8

RyanOngAI commented 6 years ago

And also any texts that contain only weird symbols that's not readable (which is equivalent to NaN also)