LIN-SHANG / InstructERC

The offical realization of InstructERC
118 stars 7 forks source link
chatglm-6b chatglm2-6b emotion-recognition-in-conversation large-language-models llama-7b llama2-7b supervised-finetuning unified-data-processing

InstructionERC

PWC

PWC

PWC

🎥 Overview

This repository contains the open-sourced official implementation of our work InstructERC:

InstructERC: Reforming Emotion Recognition in Conversation with a Retrieval Multi-task LLMs Framework

If you find this repo helpful, please cite the following paper:

@article{lei2023instructerc,
  title={Instructerc: Reforming emotion recognition in conversation with a retrieval multi-task llms framework},
  author={Lei, Shanglin and Dong, Guanting and Wang, Xiaoping and Wang, Keheng and Wang, Sirui},
  journal={arXiv preprint arXiv:2309.11911},
  year={2023}
}

Introduction

In this study, we propose a novel approach, namely InstructERC, to reformulates the ERC task from a discriminative framework to a generative framework based on LLMs. InstructERC has two significant contributions: Firstly, InstructERC introduces a simple yet effective retrieval template module, which helps the model explicitly integrate multi-granularity dialogue supervision information by concatenating the historical dialog content, label statement, and emotional domain demonstrations with high semantic similarity. Furthermore, we introduce two additional emotion alignment tasks, namely speaker identification and emotion prediction tasks, to implicitly model the dialogue role relationships and future emotional tendencies in conversations. Our LLM-based plug-and-play plugin framework significantly outperforms all previous models and achieves comprehensive SOTA on three commonly used ERC datasets. Extensive analysis of parameter-efficient and data-scaling experiments provide empirical guidance for applying InstructERC in practical scenarios. Our code will be released after blind review.

🍯 Overall Framework

image

🎯 Quick Start

This repo consists of following files:

.
├── checkpoint
├── code
│   ├── data_process_mixed.py
│   ├── data_process_plain.py
│   ├── data_process.py
│   ├── data_utils
│   ├── main_new.py
│   ├── train_and_inference_Mixed.sh
│   ├── train_and_inference_Plain.sh
│   └── train_and_inference_Uni.sh
├── data
│   ├── EmoryNLP
│   ├── iemocap
│   └── meld
├── demo
│   └── demo.ipynb
├── envs
│   └── requirements.txt
├── experiments
├── file_structure.txt
├── LLM_bases
│   ├── Bloom-560m
│   ├── ChatGLM
│   ├── ChatGLM2
│   ├── LLaMA
│   └── LLaMA2
├── original_data
│   ├── dailydialog
│   ├── EmoryNLP
│   ├── iemocap
│   ├── meld
│   └── peek_of_dataset.ipynb
└── README.md

As shown in the tree-like structure above, InstructERC consists of the following folders: code, data, demo, envs, LLM_bases, and original_data.

Dependencies

We suggest you create a docker environment for InstructERC to ensure that your previous systems, libraries and files are not effected. Make sure your Devtoolset-8-toolchain' version align with us:

yum install devtoolset-8-toolchain
yum install gcc 9.3.1

General Setup Environment:

InstructERC Setup Environment:

cd ./InstructERC/envs/
pip3 install -r requirements.txt

LLMs download

Follow this link.

ONLY Validate Our Work

Due to Meituan's code review process, the public release date of the model parameters is unpredictable. However, based on our tests on other machines, we have achieved performance results that fluctuate by ±0.5 compared to the data presented in the paper. Additionally, for the implementation of the demonstration, you can refer to the following link: https://github.com/UKPLab/sentence-transformers

Completely repeat all our work

cd ./InstructERC

To reproduce the results, we have three pipelines available.

bash train_and_inference_Uni.sh

The Shellparameter that controls the mainprocess: Flag

the value of which is 0 or 1. 
The mainprocess will interrupt when flag is 0

The hyperparameters you need setting:

1.MODEL_NAME (selections: ChatGLM, ChatGLM2, LLaMA, LLaMA2)
# MODEL_NAME determines on which model base InstructERC will be fine-tuned.

2.Experiments_setting (selections: LoRA, All-parameters)
# The Experiments_setting parameter determines whether it is full parameter fine-tuning or efficient parameter fine-tuning.

3.dataset (selections: IEMOCAP, MELD, EmoryNLP)
# The specific dataset you want InstructERC to finetune on.

4.accumulations (type:int)
# Due to the limitations of the GPU, we have chosen the method of gradient accumulation for fine-tuning.

5.graphics_card (type:int)
# The graphics_card represents the number of graphics cards you use when fine-tuning.

Notes: batch size = graphics_card * accumulations

The remaining subprocesses determined by these hyperparameters are designed to conduct different experiments.


bash train_and_inference_Mixed.sh

Compared to train_and_inference_Uni.sh, you should overlook the hyperparameter dataset due to the unified dataset including all ERC datasets.


bash train_and_inference_Mixed.sh

📋 Result:

Main Result

Table1: The main result on three benchmarks | Dataset | IEMOCAP | MELD | EmoryNLP | Average | type| |:----------------:|---------|-------|----------|---------|:-----:| | Models | W-F1 | W-F1 | W-F1 | W-F1 | | **Discriminant Models** | | EmotionIC | 69.50 | 66.40 | **40.01**| **58.63**|Attention| | SACL | 69.22 | **66.45**| 39.65 | 58.44 |Recurrent| | SKAIG | 66.98 | 65.18 | 38.88 | 57.01 |Knowledge| | GraphCFC | 68.91 | 58.86 | - | - |Graph| | UniMSE | **70.66**| 65.51 | - | - |Multimodel| | **Zero-shot + InstructERC** | | ChatGLM | **38.6**| **38.8**| 19.6 | **32.33**| LLM| | ChatGLM2 | 21.1 | 21.8 | **24.4**| 22.43 | LLM| | Llama | 0.753 | 9.12 | 5.31 | 5.06 | LLM| | Llama2 | 2.774 | 16.28 | 8.36 | 9.46 | LLM| | **LoRA + InstructERC** | | ChatGLM| 36.04 | 46.41 | 30.86 | 37.77 | LLM| | ChatGLM2| 67.54 | 65.58 | 39.09 | 57.40 | LLM| | Llama | 64.17 | 67.62 | 39.34 | 57.04 | LLM| | Llama2 | **71.39**| **69.15**| **41.37**| **60.64**| LLM|

All Parameters vs Parameter Efficiency

In order to investigate the effect of different parameter fine-tuning methods on the ERC task, we conducted comparative experiments in Table 2.

Table 2: The comparison results of different parameter fine-tuning settings on three benchmarks. | Dataset | IEMOCAP | MELD | EmoryNLP | Average | |:-----------:|---------|--------|----------|---------| | Models | W-F1 | W-F1 | W-F1 | W-F1 | | **All parameters + InstructERC** | | | | | | ChatGLM | 33.94 | 37.96 | 13.25 | 28.38 | | ChatGLM2 | 70.05 | 63.24 | 38.77 | 57.35 | | Llama | 69.38 | **66.01** | **40.21** | **58.53** | | Llama2 | **70.30** | 64.80 | 40.05 | 58.38 | | **LoRA + InstructERC** | | | | | | ChatGLM | 36.04 | 46.41 | 30.86 | 37.77 | | ChatGLM2 | 67.54 | 65.58 | 39.09 | 57.40 | | Llama | 69.71 | 68.89 | 39.90 | 59.50 | | Llama2 | **71.39** | **69.15** | **41.37** | **60.64** |

A.1 Unified dataset labeling

We continue to use the previous datasets IEMOCAP, MELD, and EmoryNLP. In accordance with The Feeling Wheel [^1] proposed in 1982, as shown in Figure 2, we align all emotional labels of three datasets under this standard, the details of which are shown in Table 3. After completing the label mapping, there are a total of 9 kinds of emotional labels, which are joyful, sad, neutral, mad, excited, powerful, fear, peaceful, and disgust.

Figure2: The Feeling Wheel[^1]

image
Table 3: Unified Label Mapping | Number | IEMOCAP | MELD | EmoryNLP | Final Emotion | | :------: | :-------: | :----: | :--------: | :-------------: | | 1 | happy | joyful | joyful | joyful | | 2 | sad | sad | sad | sad | | 3 | neutral | neutral | neutral | neutral | | 4 | angry | angry | mad | mad | | 5 | excited | N/A | N/A | excited | | 6 | N/A | surprise | powerful | powerful | | 7 | scared | fear | frustrated | fear | | 8 | N/A | N/A | peaceful | peaceful | | 9 | N/A | disgust | N/A | disgust |

A.2 Unified dataset Experiment

We still utilize the LoRA method in PEFT to train InstructERC on the unified dataset, and the training results are evaluated on the three datasets respectively. Meanwhile, we design total mix and ratio mix experiments to explore the impact of different data mixing strategies and data quantities on the model. On below basis, we further explore the impact of data sampling ratio on the model's performance. The details are shown in the Table 5, a more intutive presentation is shown in Figure 6.

| Data Precent | IEMOCAP W-F1 (Total Mix) | IEMOCAP W-F1 (Ratio Mix) | IEMOCAP W-F1 (Single) | MELD W-F1 (Total Mix) | MELD W-F1 (Ratio Mix) | MELD W-F1 (Single) | EmoryNLP W-F1 (Total Mix) | EmoryNLP W-F1 (Ratio Mix) | EmoryNLP W-F1 (Single) | | :------------: | ----------------------- | ----------------------- | --------------------- | --------------------- | --------------------- | ------------------ | ------------------------ | ------------------------ | ---------------------- | | 1 | 68.99 | 68.99 | **71.39** | 68.07 | 68.07 | **69.15** | 40.27 | 40.27 | **41.37** | | 1/2 | 67.95 | 68.96 | **69.13** | 66.50 | 66.42 | **67.54** | 39.18 | 39.33 | **39.65** | | 1/4 | 63.02 | 64.46 | **67.54** | 66.41 | 65.85 | **66.42** | 38.26 | 37.29 | **38.33** | | 1/8 | 58.48 | 60.06 | **64.13** | 64.57 | 62.94 | **65.14** | 38.27 | **39.24** | 38.24 | | 1/16 | 57.77 | 53.40 | **60.42** | 61.15 | 58.42 | **62.89** | 37.19 | **37.60** | 36.83 | | 1/32 | 45.89 | 48.50 | **54.76** | 57.38 | **57.76** | 57.72 | **37.09** | 36.09 | 34.03 | | 1/64 | 38.42 | **43.07** | 30.34 | **54.26** | 53.29 | 45.48 | **35.19** | 34.65 | 26.10 |

[^1]: Willcox, K. (1982). The Feeling Wheel. Journal of Counseling & Development, 61(3), 191-193.