Closed Victor-Almeida closed 5 years ago
AttentionDecoder input should be 3d (batch, time, features) but in your case you pass (batch, hight, width, features). In speech recognition I suppose your "time" dimension is spectrogram image width, but you should collapse height and features into a single dimension and reshape tensor appropriately.
Thanks for the quick answer. I flattened the image array and managed to solve that issue, now it has shape (28539, 30000), but now I've got another one : I had to flatten the output array too, since it had shape (97, 1), but then I get this error :
Model: "model_4"
__________________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
==================================================================================================
input_17 (InputLayer) (None, None) 0
__________________________________________________________________________________________________
embedding_25 (Embedding) (None, None, 97) 3452133 input_17[0][0]
__________________________________________________________________________________________________
PositionEmbedding (PositionEmbe (None, None, 194) 0 embedding_25[0][0]
__________________________________________________________________________________________________
concatenate_9 (Concatenate) (None, None, 291) 0 embedding_25[0][0]
PositionEmbedding[0][0]
__________________________________________________________________________________________________
input_18 (InputLayer) (None, None) 0
__________________________________________________________________________________________________
AttentionDecoder (AttentionDeco (None, None, 35589) 14202777 concatenate_9[0][0]
input_18[0][0]
==================================================================================================
Total params: 17,654,910
Trainable params: 14,202,777
Non-trainable params: 3,452,133
__________________________________________________________________________________________________
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-38-4ca425a02505> in <module>()
23 model.summary()
24
---> 25 model.fit([X_train_flattened, y_train_teacher_flattened], y_train_flattened, epochs=5, validation_data=([X_test_flattened, y_test_teacher_flattened], y_test_flattened))
2 frames
/usr/local/lib/python3.6/dist-packages/keras/engine/training_utils.py in standardize_input_data(data, names, shapes, check_batch_axis, exception_prefix)
129 ': expected ' + names[i] + ' to have ' +
130 str(len(shape)) + ' dimensions, but got array '
--> 131 'with shape ' + str(data_shape))
132 if not check_batch_axis:
133 data_shape = data_shape[1:]
ValueError: Error when checking target: expected AttentionDecoder to have 3 dimensions, but got array with shape (28539, 97)
And if I use the unflattened features array I get :
Model: "model_5"
__________________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
==================================================================================================
input_19 (InputLayer) (None, None) 0
__________________________________________________________________________________________________
embedding_28 (Embedding) (None, None, 97) 3452133 input_19[0][0]
__________________________________________________________________________________________________
PositionEmbedding (PositionEmbe (None, None, 194) 0 embedding_28[0][0]
__________________________________________________________________________________________________
concatenate_10 (Concatenate) (None, None, 291) 0 embedding_28[0][0]
PositionEmbedding[0][0]
__________________________________________________________________________________________________
input_20 (InputLayer) (None, None) 0
__________________________________________________________________________________________________
AttentionDecoder (AttentionDeco (None, None, 35589) 14202777 concatenate_10[0][0]
input_20[0][0]
==================================================================================================
Total params: 17,654,910
Trainable params: 14,202,777
Non-trainable params: 3,452,133
__________________________________________________________________________________________________
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-39-535cb4dab784> in <module>()
23 model.summary()
24
---> 25 model.fit([X_train_flattened, y_train_teacher], y_train_encoded, epochs=5, validation_data=([X_test_flattened, y_test_teacher], y_test_encoded))
2 frames
/usr/local/lib/python3.6/dist-packages/keras/engine/training_utils.py in standardize_input_data(data, names, shapes, check_batch_axis, exception_prefix)
129 ': expected ' + names[i] + ' to have ' +
130 str(len(shape)) + ' dimensions, but got array '
--> 131 'with shape ' + str(data_shape))
132 if not check_batch_axis:
133 data_shape = data_shape[1:]
ValueError: Error when checking input: expected input_20 to have 2 dimensions, but got array with shape (28539, 97, 1)
Just look at the example carefully and check the shapes
Attention decoder has input x
of shape (batch, time, features)
(and maybe y_true
of shape (batch, time)
for teacher forcing)
The targets y
for the output have shape (batch, time, 1)
(assuming the training is done with sparse categorical crossentropy)
You have spectrograms of shape (batch, height, width)
(it's your x) and output labels (it's y). So transpose x to (batch, width, height)
(not flatten!). Your y's shape should be (batch, y_len, 1)
for the target and (batch, y_len)
for teacher forcing input if you want to use it.
Additionally - it may make sense to encode your spectrogram via some convolution layers to extract useful features
Thanks again for the quick reply. On the previous answer you suggested to collapse width and features into the same dimension, but it's not possible because the features' array represents a string and the height and width represent an image, so the only option was to flatten the image array or I'd end up with the first problem again. Together with your last suggestion of using the flattened y
array for the teacher forcing it worked, no errors, but the RAM usage goes over 30 GB. I had to downsize the parameters to the following settings in order to make it work :
from keras.models import Sequential, Model
from keras.layers import Dense, Activation, LSTM, Bidirectional, Input, Embedding, concatenate
inputs = Input(shape=(None,), dtype='int64')
outp_true = Input(shape=(None,), dtype='int64')
embedded = Embedding(20, seq_size, trainable=False)(inputs)
pos_emb = PositionEmbedding(max_time=1000, n_waves = 4, d_model=seq_size)(embedded)
nnet = concatenate([embedded, pos_emb])
attention_decoder = AttentionDecoder(20, seq_size,
embedding_dim=3,
is_monotonic=False,
normalize_energy=False)
output = attention_decoder([nnet, outp_true])
model = Model(inputs=[inputs, outp_true], outputs=[output])
model.compile(
loss='sparse_categorical_crossentropy',
optimizer='rmsprop',
metrics=['accuracy'])
model.summary()
model.fit([X_train_flattened, y_train_teacher_flattened], y_train_encoded, epochs=5, validation_data=([X_test_flattened, y_test_teacher_flattened], y_test_encoded))
Thanks again for all your help.
Hello.
I'm trying to use your attention implementation to deal with speech recognition. My input is an image (an spectrogram) with shape (100, 300) represented by X_train_gs and my output is an integer array with shape (97, 1) represented by y_train_encoded, where this 1 is necessary to use the sparse categorical entropy loss function and 97 is the maximum length of the utterance. The number of words in my vocabulary is 35589 (including the start of string and end of string characters). The teacher forcing input is represented by y_train_teacher and the only difference from y_train_encoded is that the first character is the start of string character. I tried following the tutorial on example.py, but it seems I'm not doing it right. This is the code :
And this is the error I'm getting :
What am I doing wrong?