An Implementation of Attention is all you need with Chinese Corpus

The code is an implementation of Paper Attention is all you need working for dialogue generation tasks like: Chatbot、 Text Generation and so on.
Thanks to every friends who have raised issues and helped solve them. Your contribution is very important for the improvement of this project. Due to the limited support of the 'static graph mode' in coding, we decided to move the features to 2.0.0-beta1 version. However if you worry about the problems from docker building and service creation with version issues, we still keep an old version of the code written by eager mode using tensorflow 1.12.x version to refer.

Documents

|-- root/
    |-- data/
        |-- src-train.csv
        |-- src-val.csv
        |-- tgt-train.csv
        `-- tgt-val.csv
    |-- old_version/
        |-- data_loader.py
        |-- eval.py
        |-- make_dic.py
        |-- modules.py
        |-- params.py
        |-- requirements.txt
        `-- train.py
    |-- tf1.12.0-eager/
        |-- bleu.py
        |-- main.ipynb
        |-- modules.py
        |-- params.py
        |-- requirements.txt
        `-- utils.py
    |-- images/
    |-- bleu.py
    |-- main-v2.ipynb
    |-- modules-v2.py
    |-- params.py
    |-- requirements.txt
    `-- utils-v2.py

Requirements

Numpy >= 1.13.1
Tensorflow-gpu == 1.12.0
Tensorflow-gpu == 2.0.0-beta1
- cudatoolkit >= 10.0
- cudnn >= 7.4
- nvidia cuda driver version >= 410.x
tqdm
nltk
jupyter notebook

Construction

As we all know the Translation System can be used in implementing conversational model just by replacing the paris of two different sentences to questions and answers. After all, the basic conversation model named "Sequence-to-Sequence" is develped from translation system. Therefore, why we not to improve the efficiency of conversation model in generating dialogues?

With the development of BERT-based models, more and more nlp tasks are refreshed constantly. However, the language model is not contained in BERT's open source tasks. There is no doubt that on this way we still have a long way to go.

Model Advantages

A transformer model handles variable-sized input using stacks of self-attention layers instead of RNNs or CNNs. This general architecture has a number of advantages and special ticks. Now let's take them out:

It make no assumptions about the temporal/spatial relationships across the data.(However this was proved to be not sure from AutoML)
Layer outputs can be calculated in parallel, instead of a series like an RNN.(Faster training)
Distant items can affect each other's output without passing through many RNN-steps, or CNN layers.(Lower cost)
It can learn long-range dependencies, which is a challenge of dialogue system.

Implementation details

In the newest version of our code, we complete the details described in paper.

Data Generation

Use tfrecord to unified data storage format.
Use dataset to load the processed chinese token datasets.

Positional Encoding

Since the model doesn't contain any memory mechanism, positional encoding is added to give it some information about the relative position of the words in the sentence by representing a token into a d-dimensional space where tokens with similar meaning will be closer to each other.

${PE_{(pos, 2i)} = sin(pos / 10000^{2i / d_{model}})}$

${PE_{(pos, 2i+1)} = cos(pos / 10000^{2i / d_{model}})}$

Mask

We create two different type of mask during training. One is for the padding masking, the other is for the decoder look_ahead masking to keep the following tokens invisible when generating the previous ones.

Scaled dot product attention

The attention function used by the transformer takes three inputs: Q,K,V. The equation used to calculate the attention weights, which is scaled by a factor of square root of the depth is:

${Attention(Q, K, V) = softmax_k(\frac{QK^T}{\sqrt{d_k}}) V}$

Multi-head attention

Multi-head attention consists of four parts: Linear layers、Multi-head attention、Concatenation of heads and Final linear layers.

Pointwise Feedforward Network

Pointwise feedforward network consists of two fully-connected layers with ReLU activation in between.

Learning Rate Schedule

Use the adam optimizer with a custom learning rate scheduler according to the formula like:

${lrate = d_{model}^{-0.5} * min(step{\_}num^{-0.5}, step{\_}num * warmup{\_}steps^{-1.5})}$

Model Downsides

However, such a strong architecture still have some downsides:

For a time-series, the output for a time-step is calculated from the entire history of only the inputs and current hidden-state(Just like the different between CRF & HMM). So that it may be less efficient.
As the first part above said, if the input does have a temporal/spatial relationship, like text generation task, the model may be lost in the context.

Usage

old_version
- STEP 1. Download dialogue corpus with format like sample datasets and extract them to data/ folder.
- STEP 2. Adjust hyper parameters in params.py if you want.
- STEP 3. Run make_dic.py to generate vocabulary files to a new folder named dictionary.
- STEP 4. Run train.py to build the model. Checkpoint will be stored in checkpoint folder while the tensorflow event files can be found in logdir.
- STEP 5. Run eval.py to evaluate the result with testing data. Result will be stored in Results folder.
new_version(2.0 & 1.12.x with eager mode)
- follow the .ipynb to run train & eval & demo
  - if you use GPU to speed up training processing, please set up your device in the code.(It support multi-workers training)

Results

demo
Source: 肥宅初夜可以賣多少 `
Ground Truth: 肥宅還是去打手槍吧
Predict: 肥宅還是去打手槍吧
Source: 兇的女生 484 都很胸
Ground Truth: 我看都是醜的比較凶
Predict: 我看都是醜的比較
Source: 留髮不留頭
Ground Truth: 還好我早就禿頭了
Predict: 還好我早就禿頭了
Source: 當人好痛苦 R 的八卦
Ground Truth: 去中國就不用當人了
Predict: 去中國就不會有了 -
Source: 有沒有今天捷運的八卦
Ground Truth: 有 - 真的有多
Predict: 有 - 真的有多
Source: 2016 帶走了什麼 `
Ground Truth: HellKitty 麥當勞歡樂送開門 -
Predict: 麥當勞歡樂送開門 -
Source: 有沒有多益很賺的八卦
Ground Truth: 比大型包裹貴
Predict: 比大型包貴
Source: 邊緣人收到地震警報了
Ground Truth: 都跑到窗邊了才來
Predict: 都跑到邊了才來
Source: 車震
Ground Truth: 沒被刪版主是有眼睛 der
Predict: 沒被刪版主是有眼睛 der
Source: 在家跌倒的八卦 `
Ground Truth: 傷到腦袋 - 可憐
Predict: 傷到腦袋 - 可憐
Source: 大家很討厭核核嗎 `
Ground Truth: 核核欠幹阿
Predict: 核核欠幹阿
Source: 館長跟黎明打誰贏 -
Ground Truth: 我愛黎明 - 我愛黎明 -
Predict: 我愛明 - 我愛明 -
Source: 嘻嘻打打
Ground Truth: 媽的智障姆咪滾喇幹
Predict: 媽的智障姆咪滾喇幹
Source: 經典電影台詞
Ground Truth: 超時空要愛裡滿滿的梗
Predict: 超時空要愛裡滿滿滿的
Source: 2B 守得住街亭嗎 `
Ground Truth: 被病毒滅亡真的會 -
Predict: 守得住

Comparison

Implement feedforward through fully connected.

Training Accuracy

Training Loss

Implement feedforward through convolution in only one dimention.

Training Accuracy

Training Loss

Tips

If you try to use AutoGraph to speed up your training process, please make sure the datasets is padded to a fixed length. Because of the graph rebuilding operation will be activated during training, which may affect the performance. Our code only ensures the performance of version 2.0, and the lower ones can try to refer it.

Reference

Thanks for Transformer and Tensorflow

EternalFeather / Transformer-in-generating-dialogue

readme