RowitZou / topic-dialog-summ

AAAI-2021 paper: Topic-Oriented Spoken Dialogue Summarization for Customer Service with Saliency-Aware Topic Modeling.
MIT License
77 stars 9 forks source link


Pytorch implementation of the AAAI-2021 paper: Topic-Oriented Spoken Dialogue Summarization for Customer Service with Saliency-Aware Topic Modeling.

The code is partially referred to



Data Format

Each json file is a data list that includes dialogue samples. The format of a dialogue sample is shown as follows:

{"session": [
    // Utterance
     // Chinese characters
     "content": ["请", "问", "有", "什", "么", "可", "以", "帮", "您"],
     // Chinese Words
     "word": ["请问", "有", "什么", "可以", "帮", "您"],
     // Role info (Agent)
     "type": "客服"

    {"content": ["我", "想", "退", "货"],
     "word": ["我", "想", "退货"],
     // Role info (Customer)
     "type": "客户"}, 

 "summary": ["客", "户", "来", "电", "要", "求", "退", "货", "。", ...]


  1. Download BERT checkpoints.

    The pretrained BERT checkpoints can be found at:

    Put BERT checkpoints into the directory bert like this:

    --- bert
      |--- chinese_bert
         |--- config.json
         |--- pytorch_model.bin
         |--- vocab.txt
  2. Pre-train word2vec embeddings

    PYTHONPATH=. python ./src/ -data_path json_data -emb_size 100 -emb_path pretrain_emb/word2vec
  3. Data Processing

    PYTHONPATH=. python ./src/ -raw_path json_data -save_path bert_data -bert_dir bert/chinese_bert -log_file logs/preprocess.log -emb_path pretrain_emb/word2vec -tokenize -truncated -add_ex_label
  4. Pre-train the pipeline model (Ext + Abs)

    PYTHONPATH=. python ./src/ -data_path bert_data/ali -bert_dir bert/chinese_bert -log_file logs/pipeline.topic.train.log -sep_optim -topic_model -split_noise -pretrain -model_path models/pipeline_topic
  5. Train the whole model with RL

    PYTHONPATH=. python ./src/ -data_path bert_data/ali -bert_dir bert/chinese_bert -log_file logs/rl.topic.train.log -model_path models/rl_topic -topic_model -split_noise -train_from models/pipeline_topic/ -train_from_ignore_optim -lr 0.00001 -save_checkpoint_steps 500 -train_steps 30000
  6. Validate

    PYTHONPATH=. python ./src/ -mode validate -data_path bert_data/ali -bert_dir bert/chinese_bert -log_file logs/rl.topic.val.log -alpha 0.95 -model_path models/rl_topic -topic_model -split_noise -result_path results/val
  7. Test

    PYTHONPATH=. python ./src/ -mode test -data_path bert_data/ali -bert_dir bert/chinese_bert -test_from models/rl_topic/ -log_file logs/rl.topic.test.log -alpha 0.95 -topic_model -split_noise -result_path results/test


Our dialogue summarization dataset is collected from Alibaba customer service center. All dialogues are incoming calls in Mandarin Chinese that take place between a customer and a service agent. For the security of private information from customers, we performed the data desensitization and converted words to IDs. As a result, the data cannot be directly used in our released codes and other pre-trained models like BERT, but the dataset still provides some statistical information.

The desensitized data is available at Google Drive or Baidu Pan (extract code: t6nx).


     title={Topic-Oriented Spoken Dialogue Summarization for Customer Service with Saliency-Aware Topic Modeling},
     journal={Proceedings of the AAAI Conference on Artificial Intelligence},
     author={Zou, Yicheng and Zhao, Lujun and Kang, Yangyang and Lin, Jun and Peng, Minlong and Jiang, Zhuoren and Sun, Changlong and Zhang, Qi and Huang, Xuanjing and Liu, Xiaozhong},