QANet: https://arxiv.org/abs/1804.09541
This keras model refers to QANet in tensorflow (https://github.com/NLPLearn/QANet).
I find that the conv based multi-head attention in tensor2tensor (https://github.com/NLPLearn/QANet/blob/master/layers.py) performs 3%~4% better than the multiplying matrices based one in (https://github.com/bojone/attention/blob/master/attention_keras.py).
Download squad data dev-v1.1.json
and train-v1.1.json
from (https://rajpurkar.github.io/SQuAD-explorer/) to the folder ./original_data
.
Download glove.840B.300d.txt
from (https://nlp.stanford.edu/projects/glove/) to the folder ./original_data
.
Run python preprocess.py
to get the wordpiece based preprocessed data.
Run python train_QANet.py
to start training.
train_QANet.py
)QAoutputBlock.py
(with about 1% improvement)layer_norm.py
(about 0.5% improvement)I find that EMA in keras is hard to implement with GPU, and the training speed is greatly affected by it in keras. Besides, it's hard to add the slice op in keras too, so the training speed is further slower(cost about twice as much time compared with the optimized tensorflow version...).
Now, the gpu-version EMA can work perporly in keras.
All models are set in 8 heads, 128 filters.
setting | epoch | EM/F1 |
---|---|---|
batch_size=24 | 11 | 66.24% / 76.75% |
batch_size=24 + ema_decay=0.9999 | 14 | 69.51% / 79.13% |
batch_size=24 + ema_decay=0.9999 + wordpiece | 17 | 70.07% / 79.52% |
batch_size=24 + ema_decay=0.9999 + wordpiece + Cove | 13 | 71.48% / 80.85% |