asmekal / keras-monotonic-attention

seq2seq attention in keras
GNU Affero General Public License v3.0
40 stars 7 forks source link

What shall be the recommended config? #2

Closed sekarpdkt closed 6 years ago

sekarpdkt commented 6 years ago

What shall be the recommended config for following scenarios?

when t=128 and n_labels are 100. I tried,

# generating data
n, t = 100000, 128 #10000 sentences, 128 char each
n_labels = 100 #number of frequently used chars
x = np.random.randint(0, n_labels, size=(n, t))
y = np.expand_dims(x, axis=-1)
x_val = np.random.randint(0, n_labels, size=(n // 100, t))
y_val = np.expand_dims(x_val, axis=-1)

# building model
inputs = Input(shape=(None,), dtype='int64')
outp_true = Input(shape=(None,), dtype='int64')
embedded = Embedding(n_labels, n_labels, weights=[np.eye(n_labels)], trainable=False)(inputs)

pos_emb = PositionEmbedding(max_time=1000, n_waves=20, d_model=128)(embedded)
nnet = concatenate([embedded, pos_emb], axis=-1)

attention_decoder = AttentionDecoder(256, n_labels,
                                     embedding_dim=64,
                                     is_monotonic=False,
                                     normalize_energy=False)
# use teacher forcing
output = attention_decoder([nnet, outp_true])
## (alternative) without teacher forcing
#output = attention_decoder(nnet)
model = Model(inputs=[inputs, outp_true], outputs=[output])
model.compile(
    loss='sparse_categorical_crossentropy',
    optimizer='adadelta',
    metrics=['accuracy'])
model.summary()

model.fit([x, np.squeeze(y, axis=-1)], y,
          epochs=2,
          validation_data=([x_val, np.squeeze(y_val, axis=-1)], y_val))

accuracy is struck at 1%. with and with out teacher forcing

asmekal commented 6 years ago

First of all, I would recommend to start training with shorter sequences and gradually increase time dimension of training data (as long as shape=(None,) in the inputs, model can work with data of any time dimension). This part is probably the most important.

Next, for that particular toy example teacher forcing and output embedding size are unimportant and may even slow down convergence, because labels were generated randomly and next output has not any connection with the previous one.

For PositionalEmbedding the equation for 2i-th wave is encodings[2 * i] = np.sin(positions / 10. ** (2. * i / d_model)). My intuition is that the last wave should have period max_time/5 <= T <= max_time (max time is 128 in your example). So where should be a few entire periods for the sequence - at least one, but not too many.

I also dicreased units of AttentionDecoder for faster convergence.

With all of that being said I used the following configuration and achieved 78% accuracy on large sequences before gradients explode. I think after adding gradient clipping/changing learning rate the problem should be solved perfectly

# generating data
n, t = 100000, 20 #start from the smaller sequences
n_labels = 100 #number of frequently used chars
x = np.random.randint(0, n_labels, size=(n, t))
y = np.expand_dims(x, axis=-1)
x_val = np.random.randint(0, n_labels, size=(n // 100, t))
y_val = np.expand_dims(x_val, axis=-1)

# building model
inputs = Input(shape=(None,), dtype='int64')
outp_true = Input(shape=(None,), dtype='int64')
embedded = Embedding(n_labels, n_labels, weights=[np.eye(n_labels)], trainable=False)(inputs)

pos_emb = PositionEmbedding(max_time=1000, n_waves=50, d_model=128)(embedded)
nnet = concatenate([embedded, pos_emb], axis=-1)

attention_decoder = AttentionDecoder(50, n_labels,
                                     embedding_dim=16,
                                     is_monotonic=False,
                                     normalize_energy=False)

## (alternative) without teacher forcing
output = attention_decoder(nnet)
model = Model(inputs=[inputs, outp_true], outputs=[output])
model.compile(
    loss='sparse_categorical_crossentropy',
    optimizer='adadelta',
    metrics=['accuracy'])
model.summary()

model.fit([x, np.squeeze(y, axis=-1)], y,
          epochs=2,
          validation_data=([x_val, np.squeeze(y_val, axis=-1)], y_val))

# setting desired sequence length
n, t = 100000, 128  # 128 chars in sequence
x = np.random.randint(0, n_labels, size=(n, t))
y = np.expand_dims(x, axis=-1)
x_val = np.random.randint(0, n_labels, size=(n // 100, t))
y_val = np.expand_dims(x_val, axis=-1)

model.fit([x, np.squeeze(y, axis=-1)], y,
          epochs=5,
          validation_data=([x_val, np.squeeze(y_val, axis=-1)], y_val))
sekarpdkt commented 6 years ago

Thanks. I will try. I agree. Initially it was struck at 1%, but later it reached upto 50% with 10Epoch. Will be trying for 25 in sometimes and update.

Noting some points for the benefit of others:

  1. I am able to save and load weights (not model itself, which is anyway okay).
  2. It requires Keras 2.x, will not work on Keras 1.x.
sekarpdkt commented 6 years ago

Ok. If we decided not to use teacher forcing, then can we change the model as


## (alternative) without teacher forcing
output = attention_decoder(nnet)
model = Model(inputs=inputs, outputs=output)
model.compile(
    loss='sparse_categorical_crossentropy',
    optimizer='adadelta',
    metrics=['accuracy'])
model.summary()

model.fit(x, y,
          epochs=5,
          validation_data=(x_val, y_val))
asmekal commented 6 years ago

Certainly we can (and we should). I just was too lazy to change that part