Topic-Dialog-Summ

Pytorch implementation of the AAAI-2021 paper: Topic-Oriented Spoken Dialogue Summarization for Customer Service with Saliency-Aware Topic Modeling.

The code is partially referred to https://github.com/nlpyang/PreSumm.

Requirements

Python 3.6 or higher
torch==1.1.0
pytorch-transformers==1.1.0
torchtext==0.4.0
rouge==0.3.2
tensorboardX==2.1
nltk==3.5
gensim==3.8.3

Environment

Tesla V100 16GB GPU
CUDA 10.2

Data Format

Each json file is a data list that includes dialogue samples. The format of a dialogue sample is shown as follows:

{"session": [
    // Utterance
    {
     // Chinese characters
     "content": ["请", "问", "有", "什", "么", "可", "以", "帮", "您"],
     // Chinese Words
     "word": ["请问", "有", "什么", "可以", "帮", "您"],
     // Role info (Agent)
     "type": "客服"
    },

    {"content": ["我", "想", "退", "货"],
     "word": ["我", "想", "退货"],
     // Role info (Customer)
     "type": "客户"}, 

    ...
 ],
 "summary": ["客", "户", "来", "电", "要", "求", "退", "货", "。", ...]
}

Usage

Download BERT checkpoints.

The pretrained BERT checkpoints can be found at:
- Chinese BERT: https://github.com/ymcui/Chinese-BERT-wwm
- English BERT: https://github.com/google-research/bert
Put BERT checkpoints into the directory bert like this:
```
--- bert
  |
  |--- chinese_bert
     |
     |--- config.json
     |
     |--- pytorch_model.bin
     |
     |--- vocab.txt
```

Pre-train word2vec embeddings

PYTHONPATH=. python ./src/train_emb.py -data_path json_data -emb_size 100 -emb_path pretrain_emb/word2vec

Data Processing

PYTHONPATH=. python ./src/preprocess.py -raw_path json_data -save_path bert_data -bert_dir bert/chinese_bert -log_file logs/preprocess.log -emb_path pretrain_emb/word2vec -tokenize -truncated -add_ex_label

Pre-train the pipeline model (Ext + Abs)

PYTHONPATH=. python ./src/train.py -data_path bert_data/ali -bert_dir bert/chinese_bert -log_file logs/pipeline.topic.train.log -sep_optim -topic_model -split_noise -pretrain -model_path models/pipeline_topic

Train the whole model with RL

PYTHONPATH=. python ./src/train.py -data_path bert_data/ali -bert_dir bert/chinese_bert -log_file logs/rl.topic.train.log -model_path models/rl_topic -topic_model -split_noise -train_from models/pipeline_topic/model_step_80000.pt -train_from_ignore_optim -lr 0.00001 -save_checkpoint_steps 500 -train_steps 30000

Validate

PYTHONPATH=. python ./src/train.py -mode validate -data_path bert_data/ali -bert_dir bert/chinese_bert -log_file logs/rl.topic.val.log -alpha 0.95 -model_path models/rl_topic -topic_model -split_noise -result_path results/val

Test

PYTHONPATH=. python ./src/train.py -mode test -data_path bert_data/ali -bert_dir bert/chinese_bert -test_from models/rl_topic/model_step_30000.pt -log_file logs/rl.topic.test.log -alpha 0.95 -topic_model -split_noise -result_path results/test

Data

Our dialogue summarization dataset is collected from Alibaba customer service center. All dialogues are incoming calls in Mandarin Chinese that take place between a customer and a service agent. For the security of private information from customers, we performed the data desensitization and converted words to IDs. As a result, the data cannot be directly used in our released codes and other pre-trained models like BERT, but the dataset still provides some statistical information.

The desensitized data is available at Google Drive or Baidu Pan (extract code: t6nx).

Citation

@article{Zou_Zhao_Kang_Lin_Peng_Jiang_Sun_Zhang_Huang_Liu_2021,
     title={Topic-Oriented Spoken Dialogue Summarization for Customer Service with Saliency-Aware Topic Modeling},
     volume={35},
     url={https://ojs.aaai.org/index.php/AAAI/article/view/17723},
     number={16},
     journal={Proceedings of the AAAI Conference on Artificial Intelligence},
     author={Zou, Yicheng and Zhao, Lujun and Kang, Yangyang and Lin, Jun and Peng, Minlong and Jiang, Zhuoren and Sun, Changlong and Zhang, Qi and Huang, Xuanjing and Liu, Xiaozhong},
     year={2021},
     month={May},
     pages={14665-14673}
    }

RowitZou / topic-dialog-summ

readme