Dealing with a different number of input and output dimensions

Victor-Almeida commented 5 years ago

Hello.

I'm trying to use your attention implementation to deal with speech recognition. My input is an image (an spectrogram) with shape (100, 300) represented by X_train_gs and my output is an integer array with shape (97, 1) represented by y_train_encoded, where this 1 is necessary to use the sparse categorical entropy loss function and 97 is the maximum length of the utterance. The number of words in my vocabulary is 35589 (including the start of string and end of string characters). The teacher forcing input is represented by y_train_teacher and the only difference from y_train_encoded is that the first character is the start of string character. I tried following the tutorial on example.py, but it seems I'm not doing it right. This is the code :

from keras.models import Sequential, Model
from keras.layers import Dense, Activation, LSTM, Bidirectional, Input, Embedding, concatenate

inputs = Input(shape=X_train_gs[0].shape, dtype=np.uint8)
outp_true = Input(shape=y_train_teacher[0].shape, dtype=np.uint8)
embedded = Embedding(vocab_size, seq_size, trainable=False)(inputs)

pos_emb = PositionEmbedding(max_time=1000, n_waves = seq_size, d_model=seq_size)(embedded)
nnet = concatenate([embedded, pos_emb], axis=-1)

attention_decoder = AttentionDecoder(seq_size, vocab_size,
                                     embedding_dim=5,
                                     is_monotonic=False,
                                     normalize_energy=False)
# use teacher forcing
output = attention_decoder([nnet, outp_true])
# (alternative) without teacher forcing
# output = attention_decoder(nnet)
# or
# output = attention_decoder([nnet, outp_true], use_teacher_forcing=False)
# the last variant is useful for generating outputs with number of timesteps different from input
# (without it the length of the output sequence will be the same as input sequence)
# so to produce outputs of different shape on inference the one could place outp_true=np.zeros(batch_size, outp_time)
model = Model(inputs=[inputs, outp_true], outputs=[output])
model.compile(
    loss='categorical_crossentropy',
    optimizer='rmsprop',
    metrics=['accuracy'])
model.summary()

model.fit([X_train_gs, y_train_teacher], y_train_encoded, epochs=5)

And this is the error I'm getting :

---------------------------------------------------------------------------

ValueError                                Traceback (most recent call last)

<ipython-input-33-0f92575bfedc> in <module>()
      7 
      8 pos_emb = PositionEmbedding(max_time=1000, n_waves = seq_size, d_model=seq_size)(embedded)
----> 9 nnet = concatenate([embedded, pos_emb], axis=-1)
     10 
     11 attention_decoder = AttentionDecoder(seq_size, vocab_size,

2 frames

/usr/local/lib/python3.6/dist-packages/keras/layers/merge.py in build(self, input_shape)
    360                              'inputs with matching shapes '
    361                              'except for the concat axis. '
--> 362                              'Got inputs shapes: %s' % (input_shape))
    363 
    364     def _merge_function(self, inputs):

ValueError: A `Concatenate` layer requires inputs with matching shapes except for the concat axis. Got inputs shapes: [(None, 100, 300, 97), (None, None, 194)]

What am I doing wrong?

asmekal commented 5 years ago

AttentionDecoder input should be 3d (batch, time, features) but in your case you pass (batch, hight, width, features). In speech recognition I suppose your "time" dimension is spectrogram image width, but you should collapse height and features into a single dimension and reshape tensor appropriately.

Victor-Almeida commented 5 years ago

Thanks for the quick answer. I flattened the image array and managed to solve that issue, now it has shape (28539, 30000), but now I've got another one : I had to flatten the output array too, since it had shape (97, 1), but then I get this error :

Model: "model_4"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
input_17 (InputLayer)           (None, None)         0                                            
__________________________________________________________________________________________________
embedding_25 (Embedding)        (None, None, 97)     3452133     input_17[0][0]                   
__________________________________________________________________________________________________
PositionEmbedding (PositionEmbe (None, None, 194)    0           embedding_25[0][0]               
__________________________________________________________________________________________________
concatenate_9 (Concatenate)     (None, None, 291)    0           embedding_25[0][0]               
                                                                 PositionEmbedding[0][0]          
__________________________________________________________________________________________________
input_18 (InputLayer)           (None, None)         0                                            
__________________________________________________________________________________________________
AttentionDecoder (AttentionDeco (None, None, 35589)  14202777    concatenate_9[0][0]              
                                                                 input_18[0][0]                   
==================================================================================================
Total params: 17,654,910
Trainable params: 14,202,777
Non-trainable params: 3,452,133
__________________________________________________________________________________________________

---------------------------------------------------------------------------

ValueError                                Traceback (most recent call last)

<ipython-input-38-4ca425a02505> in <module>()
     23 model.summary()
     24 
---> 25 model.fit([X_train_flattened, y_train_teacher_flattened], y_train_flattened, epochs=5, validation_data=([X_test_flattened, y_test_teacher_flattened], y_test_flattened))

2 frames

/usr/local/lib/python3.6/dist-packages/keras/engine/training_utils.py in standardize_input_data(data, names, shapes, check_batch_axis, exception_prefix)
    129                         ': expected ' + names[i] + ' to have ' +
    130                         str(len(shape)) + ' dimensions, but got array '
--> 131                         'with shape ' + str(data_shape))
    132                 if not check_batch_axis:
    133                     data_shape = data_shape[1:]

ValueError: Error when checking target: expected AttentionDecoder to have 3 dimensions, but got array with shape (28539, 97)

And if I use the unflattened features array I get :

Model: "model_5"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
input_19 (InputLayer)           (None, None)         0                                            
__________________________________________________________________________________________________
embedding_28 (Embedding)        (None, None, 97)     3452133     input_19[0][0]                   
__________________________________________________________________________________________________
PositionEmbedding (PositionEmbe (None, None, 194)    0           embedding_28[0][0]               
__________________________________________________________________________________________________
concatenate_10 (Concatenate)    (None, None, 291)    0           embedding_28[0][0]               
                                                                 PositionEmbedding[0][0]          
__________________________________________________________________________________________________
input_20 (InputLayer)           (None, None)         0                                            
__________________________________________________________________________________________________
AttentionDecoder (AttentionDeco (None, None, 35589)  14202777    concatenate_10[0][0]             
                                                                 input_20[0][0]                   
==================================================================================================
Total params: 17,654,910
Trainable params: 14,202,777
Non-trainable params: 3,452,133
__________________________________________________________________________________________________

---------------------------------------------------------------------------

ValueError                                Traceback (most recent call last)

<ipython-input-39-535cb4dab784> in <module>()
     23 model.summary()
     24 
---> 25 model.fit([X_train_flattened, y_train_teacher], y_train_encoded, epochs=5, validation_data=([X_test_flattened, y_test_teacher], y_test_encoded))

2 frames

/usr/local/lib/python3.6/dist-packages/keras/engine/training_utils.py in standardize_input_data(data, names, shapes, check_batch_axis, exception_prefix)
    129                         ': expected ' + names[i] + ' to have ' +
    130                         str(len(shape)) + ' dimensions, but got array '
--> 131                         'with shape ' + str(data_shape))
    132                 if not check_batch_axis:
    133                     data_shape = data_shape[1:]

ValueError: Error when checking input: expected input_20 to have 2 dimensions, but got array with shape (28539, 97, 1)

asmekal commented 5 years ago

Just look at the example carefully and check the shapes

Attention decoder has input x of shape (batch, time, features) (and maybe y_true of shape (batch, time) for teacher forcing) The targets y for the output have shape (batch, time, 1) (assuming the training is done with sparse categorical crossentropy)

You have spectrograms of shape (batch, height, width) (it's your x) and output labels (it's y). So transpose x to (batch, width, height) (not flatten!). Your y's shape should be (batch, y_len, 1) for the target and (batch, y_len) for teacher forcing input if you want to use it.

Additionally - it may make sense to encode your spectrogram via some convolution layers to extract useful features

Victor-Almeida commented 5 years ago

Thanks again for the quick reply. On the previous answer you suggested to collapse width and features into the same dimension, but it's not possible because the features' array represents a string and the height and width represent an image, so the only option was to flatten the image array or I'd end up with the first problem again. Together with your last suggestion of using the flattened y array for the teacher forcing it worked, no errors, but the RAM usage goes over 30 GB. I had to downsize the parameters to the following settings in order to make it work :

from keras.models import Sequential, Model
from keras.layers import Dense, Activation, LSTM, Bidirectional, Input, Embedding, concatenate

inputs = Input(shape=(None,), dtype='int64')
outp_true = Input(shape=(None,), dtype='int64')
embedded = Embedding(20, seq_size, trainable=False)(inputs)

pos_emb = PositionEmbedding(max_time=1000, n_waves = 4, d_model=seq_size)(embedded)
nnet = concatenate([embedded, pos_emb])

attention_decoder = AttentionDecoder(20, seq_size,
                                     embedding_dim=3,
                                     is_monotonic=False,
                                     normalize_energy=False)

output = attention_decoder([nnet, outp_true])

model = Model(inputs=[inputs, outp_true], outputs=[output])
model.compile(
    loss='sparse_categorical_crossentropy',
    optimizer='rmsprop',
    metrics=['accuracy'])
model.summary()

model.fit([X_train_flattened, y_train_teacher_flattened], y_train_encoded, epochs=5, validation_data=([X_test_flattened, y_test_teacher_flattened], y_test_encoded))

Thanks again for all your help.

asmekal / keras-monotonic-attention

Dealing with a different number of input and output dimensions #13