Closed asmekal closed 6 years ago
It seems that there are 2 ways to solve the problem and help model to learn:
...
emb = Embedding(n_labels, 10)(inp)
x = LSTM(units)(emb)
x = AttentionDecoder(10, n_labels)(x)
...
...
emb = Embedding(n_labels, 10)(inp)
pos_emb = PositionEncoding(...)(emb)
x = concatenate([emb, pos_emb], axis=-1)
x = AttentionDecoder(10, n_labels)(x)
...
Actually these 2 ways are generally the same thing, they are aimed to give some positional imformation which is not known (at least in that toy example). And without that positional information model is unable to fit.
So the implemented attention is able to fit, although I can not get why is it so hard to move attention always by one step... But anyway the problem is solved so I'll close the issue
I think your intuition is correct, there needs to be some positional information otherwise the attention decoder makes no sense for what it is receiving (just a bunch of vectors representing independent entries). It then has to look at this input sequence and try to figure out how to decode it but has no information about how each entry relates to the other. Neat experiment, thank you!
For a long time I've tried to adapt your model to OCR problem. At some point I found out that even with frozen encoder features, recieved by CTC model (that performs well) I can not reproduce the same performance. Then I made the simpliest problem which any reasonable classifier should solve, kind of autoencoder
To no surprise, if
Dense
is used instead ofAttentionDecoder
we will recieveaccuracy = 1
immediately after the first epoch. But withAttentionDecoder
model stalemates at aroundaccuracy = 0.5
with no further progress at all.It seems to be working well only if
t <= 2
, maybe due toinitial_state
which is initialized from first timestep:s0 = activations.tanh(K.dot(inputs[:, 0], self.W_s))
, and attention is overfitted on the second timestep. But even witht = 3
accuracy does not exceed 0.7, which is close to guessing two labels and returning last one at random.Any ideas?