Closed sekarpdkt closed 6 years ago
First of all, I would recommend to start training with shorter sequences and gradually increase time dimension of training data (as long as shape=(None,)
in the inputs, model can work with data of any time dimension). This part is probably the most important.
Next, for that particular toy example teacher forcing and output embedding size are unimportant and may even slow down convergence, because labels were generated randomly and next output has not any connection with the previous one.
For PositionalEmbedding
the equation for 2i-th wave is encodings[2 * i] = np.sin(positions / 10. ** (2. * i / d_model))
. My intuition is that the last wave should have period max_time/5 <= T <= max_time
(max time is 128 in your example). So where should be a few entire periods for the sequence - at least one, but not too many.
I also dicreased units
of AttentionDecoder
for faster convergence.
With all of that being said I used the following configuration and achieved 78% accuracy on large sequences before gradients explode. I think after adding gradient clipping/changing learning rate the problem should be solved perfectly
# generating data
n, t = 100000, 20 #start from the smaller sequences
n_labels = 100 #number of frequently used chars
x = np.random.randint(0, n_labels, size=(n, t))
y = np.expand_dims(x, axis=-1)
x_val = np.random.randint(0, n_labels, size=(n // 100, t))
y_val = np.expand_dims(x_val, axis=-1)
# building model
inputs = Input(shape=(None,), dtype='int64')
outp_true = Input(shape=(None,), dtype='int64')
embedded = Embedding(n_labels, n_labels, weights=[np.eye(n_labels)], trainable=False)(inputs)
pos_emb = PositionEmbedding(max_time=1000, n_waves=50, d_model=128)(embedded)
nnet = concatenate([embedded, pos_emb], axis=-1)
attention_decoder = AttentionDecoder(50, n_labels,
embedding_dim=16,
is_monotonic=False,
normalize_energy=False)
## (alternative) without teacher forcing
output = attention_decoder(nnet)
model = Model(inputs=[inputs, outp_true], outputs=[output])
model.compile(
loss='sparse_categorical_crossentropy',
optimizer='adadelta',
metrics=['accuracy'])
model.summary()
model.fit([x, np.squeeze(y, axis=-1)], y,
epochs=2,
validation_data=([x_val, np.squeeze(y_val, axis=-1)], y_val))
# setting desired sequence length
n, t = 100000, 128 # 128 chars in sequence
x = np.random.randint(0, n_labels, size=(n, t))
y = np.expand_dims(x, axis=-1)
x_val = np.random.randint(0, n_labels, size=(n // 100, t))
y_val = np.expand_dims(x_val, axis=-1)
model.fit([x, np.squeeze(y, axis=-1)], y,
epochs=5,
validation_data=([x_val, np.squeeze(y_val, axis=-1)], y_val))
Thanks. I will try. I agree. Initially it was struck at 1%, but later it reached upto 50% with 10Epoch. Will be trying for 25 in sometimes and update.
Noting some points for the benefit of others:
Ok. If we decided not to use teacher forcing, then can we change the model as
## (alternative) without teacher forcing
output = attention_decoder(nnet)
model = Model(inputs=inputs, outputs=output)
model.compile(
loss='sparse_categorical_crossentropy',
optimizer='adadelta',
metrics=['accuracy'])
model.summary()
model.fit(x, y,
epochs=5,
validation_data=(x_val, y_val))
Certainly we can (and we should). I just was too lazy to change that part
What shall be the recommended config for following scenarios?
when t=128 and n_labels are 100. I tried,
accuracy is struck at 1%. with and with out teacher forcing