A Strange phenomenon when training AttentionSeq2Seq model

I use the AttentionSeq2Seq to fit the token normalization mission. The input is a bunch of one-hot data, and the output is a distribution on the output vocab, batch size=128. model = AttentionSeq2Seq(input_dim=197, input_length=80, hidden_dim=197, output_length=111, output_dim=197, dropout = 0.0, depth = 1)

model.compile(loss='categorical_crossentropy', optimizer='Adagrad', metrics=['accuracy']) I have tried different optimizer like adam and Adadelta, but the odd phenomenon always exist. I have tried different batch size=64,128,256, but the odd phenomenon exist too. when I was training ,the loss is increasing while accuracy is increasing as well Epoch 1/100

1/313 [.........] - ETA: 610s - loss: 6.2404 - acc: 0.0000e+00 2/313 [.........] - ETA: 473s - loss: 11.0287 - acc: 0.4753 3/313 [.........] - ETA: 430s - loss: 12.6185 - acc: 0.6345 4/313 [.........] - ETA: 404s - loss: 13.4089 - acc: 0.7140 5/313 [.........] - ETA: 389s - loss: 13.8778 - acc: 0.7611 6/313 [.........] - ETA: 384s - loss: 14.1902 - acc: 0.7929 7/313 [.........] - ETA: 376s - loss: 14.4087 - acc: 0.8152 8/313 [.........] - ETA: 374s - loss: 13.0383 - acc: 0.8321 9/313 [.........] - ETA: 369s - loss: 13.3385 - acc: 0.8452 10/313 [.........] - ETA: 363s - loss: 13.5804 - acc: 0.8558 11/313 [>........] - ETA: 358s - loss: 13.7774 - acc: 0.8644 12/313 [>........] - ETA: 355s - loss: 13.9411 - acc: 0.8715 13/313 [>........] - ETA: 352s - loss: 14.0796 - acc: 0.8775

unfortunately, after a few seconds, the acc started decreasing, and no longer improved: 12/313 [>........] - ETA: 355s - loss: 15.3511 - acc: 0.7715 13/313 [>........] - ETA: 352s - loss: 15.4796 - acc: 0.7705

More strange thing is that, at some time, the acc was very low without any change on dataset and model parameters: Epoch 1/100

1/313 [.........] - ETA: 336s - loss: 6.7731 - acc: 0.0000e+00 2/313 [.........] - ETA: 261s - loss: 11.2224 - acc: 0.0000e+00 3/313 [.........] - ETA: 237s - loss: 12.6927 - acc: 0.0030 4/313 [.........] - ETA: 225s - loss: 13.4339 - acc: 0.0045 5/313 [.........] - ETA: 217s - loss: 13.8691 - acc: 0.0054 6/313 [.........] - ETA: 212s - loss: 14.1624 - acc: 0.0072 7/313 [.........] - ETA: 209s - loss: 14.3652 - acc: 0.0088 8/313 [.........] - ETA: 208s - loss: 13.0320 - acc: 0.0099 9/313 [.........] - ETA: 205s - loss: 13.3196 - acc: 0.0108 10/313 [.........] - ETA: 202s - loss: 13.5544 - acc: 0.0116 11/313 [.........] - ETA: 200s - loss: 13.7453 - acc: 0.0121 12/313 [.........] - ETA: 199s - loss: 13.9046 - acc: 0.0126 13/313 [.........] - ETA: 198s - loss: 14.0383 - acc: 0.0130

farizrahman4u / seq2seq

A Strange phenomenon when training AttentionSeq2Seq model #226