A Comprehensive Assessment of Dialog Evaluation Metrics

This repository contains the source code for the following paper:

Prerequisties

We use conda to mangage environments for different metrics.

Each directory in conda_envs holds an environment specification. Please install all of them before starting the next step.

Take the installation of conda_envs/eval_base for example, please run

conda env create -f conda_envs/eval_base/environment.yml

Note that there are some packages could not be installed via this method.

If you want find any packages such as bleurt, nlg-eval, and packages downloaded by spaCy are missing, please install it with official instructions.

We apologize for any inconvenience.

Data Preparation

The directory of each qualitiy-annotated data is placed in data, with the data_loader.py for parsing the data.

Please follow the below instructions to downlaod each dataset, place it into corresponding directory, and run the data_loader.py directly to see if you use the correct data.

DSTC6 Data

Download human_rating_scores.txt from https://www.dropbox.com/s/oh1trbos0tjzn7t/dstc6_t2_evaluation.tgz .

DSTC9 Data

Download and Place the data directory https://github.com/ictnlp/DialoFlow/tree/main/FlowScore/data into data/dstc9_data.

Engage Data

Download https://github.com/PlusLabNLP/PredictiveEngagement/blob/master/data/Eng_Scores_queries_gen_gtruth_replies.csv and rename it to engage_all.csv.

Fed Data

Download http://shikib.com/fed_data.json .

Grade Data

Download and place each directory in https://github.com/li3cmz/GRADE/tree/main/evaluation/eval_data as data/grade_data/[convai2|dailydialog|empatheticdialogues].

Also download the human_score.txt in https://github.com/li3cmz/GRADE/tree/main/evaluation/human_score into the corresponding data/grade_data/[convai2|dailydialog|empatheticdialogues].

Holistic Data

Download context_data_release.csv and fluency_data_release.csv from https://github.com/alexzhou907/dialogue_evaluation .

USR Data

Download TopicalChat and PersonaChat data from http://shikib.com/usr

Metric Installation

For baselines, we use the nlg-eval. Please folloow the instruction to install it.

For each dialog metrics, please follow the instructions in README in the corresponding directory.

Running Notes for Specific metrics

bert-as-service

PredictiveEngage, BERT-RUBER and PONE requires the running bert-as-service.

If you want to evaluate them, please install and run bert-as-service following the instrucitons here.

We also provide a script we used to run bert-as-service run_bert_as_service.sh, feel free to use it.

running USR and FED

We used a web server for running USR and FED in our experiments.

Please modify path in usr_fed/usr/usr_server.py and usr_fed/fed/fed_server.py to start the server, and modify the path in usr_fed_metric.py.

How to evaluate

After you downlaod all datasets, run gen_data.py to transform all datasets into the input format for all metrics. If you only want to evaluate metric metric and dataset dataset, run with gen_data.py --source_data dataset --target_format metric
Modify the path in run_eval.sh as specified in the script, since we need to activate Conda environment when running the script. Run eval_metrics.sh to evaluate all quality-anntoated data.
Some metrics generate the output in its special format. Therefore, we should run read_result.py to read the results of those metrics and transform it into outputs. As step 1, you can specify the metric and data by read_result.py --metric metric --eval_data dataset.
The outputs/METRIC/DATA/results.json holds the prediction score of each metrics (METRIC) and qualitiy-anntoated data (DATA), while running data_loader.py directly in each data directory also generates the corresponding human scores. You can perform any analysis with the data (The jupyter notebook used in our analysis will be released) .

For example, outputs/grade/dstc9_data/results.json could be


    'GRADE': # the metric name
    [
        0.2568123, # the score of the first sample
        0.1552132, 
        ...
        0.7812346
    ]

Results

All values are statistically significant to p-value < 0.05, unless marked by *.

USR Data

	USR-TopicalChat				USR-Pearsonachat
	Turn-Level		System-Level		Turn-Level		System-Level
	P	S	P	S	P	S	P	S
BLEU-4	0.216	0.296	0.874*	0.900	0.135	0.090*	0.841*	0.800*
METEOR	0.336	0.391	0.943	0.900	0.253	0.271	0.907*	0.800*
ROUGE-L	0.275	0.287	0.814*	0.900	0.066*	0.038*	0.171*	0.000*
ADEM	-0.060*	-0.061*	0.202*	0.700*	-0.141	-0.085*	0.523*	0.400*
BERTScore	0.298	0.325	0.854*	0.900	0.152	0.122*	0.241*	0.000*
BLEURT	0.216	0.261	0.630*	0.900	0.065*	0.054*	-0.125*	0.000*
QuestEval	0.300	0.338	0.943	1.000	0.176	0.236	0.885*	1.000
RUBER	0.247	0.259	0.876*	1.000	0.131	0.190	0.997	1.000
BERT-RUBER	0.342	0.348	0.992	0.900	0.266	0.248	0.958	0.200*
PONE	0.271	0.274	0.893	0.500*	0.373	0.375	0.979	0.800*
MAUDE	0.044*	0.083*	0.317*	-0.200*	0.345	0.298	0.440*	0.400*
DEB	0.180	0.116	0.818*	0.400*	0.291	0.373	0.989	1.000
GRADE	0.200	0.217	0.553*	0.100*	0.358	0.352	0.811*	1.000
DynaEval	-0.032*	-0.022*	-0.248*	0.100*	0.149	0.171	0.584*	0.800*
USR	0.412	0.423	0.967	0.900	0.440	0.418	0.864*	1.000
USL-H	0.322	0.340	0.966	0.900	0.495	0.523	0.969	0.800*
DialogRPT	0.120	0.105*	0.944	0.600*	-0.064*	-0.083*	0.347*	0.800*
Deep AM-FM	0.285	0.268	0.969	0.700*	0.228	0.219	0.965	1.000
HolisticEval	-0.147	-0.123	-0.919	-0.200*	0.087*	0.113*	0.051*	0.000*
PredictiveEngage	0.222	0.310	0.870*	0.900	-0.003*	0.033*	0.683*	0.200*
FED	-0.124	-0.135	0.730*	0.100*	-0.028*	-0.000*	0.005*	0.400*
FlowScore	0.095*	0.082*	-0.150*	0.400*	0.118*	0.079*	0.678*	0.800*
FBD	-	-	0.916	0.100*	-	-	0.644*	0.800*

GRADE Data

	GRADE-ConvAI2				GRADE-DailyDialog				GRADE-EmpatheticDialogue
	Turn-Level		System-Level		Turn-Level		System-Level		Turn-Level		System-Level
	P	S	P	S	P	S	P	S	P	S	P	S
BLEU-4	0.003*	0.128	0.034*	0.000*	0.075*	0.184	1.000*	1.000	-0.051*	0.002*	1.000*	1.000
METEOR	0.145	0.181	0.781*	0.600*	0.096*	0.010*	-1.000*	-1.000	0.118	0.055*	1.000*	1.000
ROUGE-L	0.136	0.140	0.209*	0.000*	0.154	0.147	1.000*	1.000	0.029*	-0.013*	1.000*	1.000
ADEM	-0.060*	-0.057*	-0.368*	-0.200*	0.064*	0.071*	1.000*	1.000	-0.036*	-0.028*	1.000*	1.000
BERTScore	0.225	0.224	0.918*	0.800*	0.129	0.100*	-1.000*	-1.000	0.046*	0.033*	1.000*	1.000
BLEURT	0.125	0.120	-0.777*	-0.400*	0.176	0.133	1.000*	1.000	0.087*	0.051*	1.000*	1.000
QuestEval	0.279	0.319	0.283*	0.400*	0.020*	0.006*	-1.000*	-1.000	0.201	0.272	1.000*	1.000
RUBER	-0.027*	-0.042*	-0.458*	-0.400*	-0.084*	-0.094*	-1.000*	-1.000	-0.078*	-0.039*	1.000*	1.000
BERT-RUBER	0.309	0.314	0.885*	1.000	0.134	0.128	-1.000*	-1.000	0.163	0.148	1.000*	1.000
PONE	0.362	0.373	0.816*	0.800*	0.163	0.163	-1.000*	-1.000	0.177	0.161	1.000*	1.000
MAUDE	0.351	0.304	0.748*	0.800*	-0.036*	-0.073*	1.000*	1.000	0.007*	-0.057*	1.000*	1.000
DEB	0.426	0.504	0.995	1.000	0.337	0.363	1.000*	1.000	0.356	0.395	1.000*	1.000
GRADE	0.566	0.571	0.883*	0.800*	0.278	0.253	-1.000*	-1.000	0.330	0.297	1.000*	1.000
DynaEval	0.138	0.131	-0.996	-1.000	0.108*	0.120	-1.000*	-1.000	0.146	0.141	-1.000*	-1.000
USR	0.501	0.500	0.995	1.000	0.057*	0.057*	-1.000*	-1.000	0.264	0.255	1.000*	1.000
USL-H	0.443	0.457	0.971	1.000	0.108*	0.093*	-1.000*	-1.000	0.293	0.235	1.000*	1.000
DialogRPT	0.137	0.158	-0.311*	-0.600*	-0.000*	0.037*	-1.000*	-1.000	0.211	0.203	1.000*	1.000
Deep AM-FM	0.117	0.130	0.774*	0.400*	0.026*	0.022*	1.000*	1.000	0.083*	0.058*	1.000*	1.000
HolisticEval	-0.030*	-0.010*	-0.297*	-0.400*	0.025*	0.020*	1.000*	1.000	0.199	0.204	-1.000*	-1.000
PredictiveEngage	0.154	0.164	0.601*	0.600*	-0.133	-0.135	-1.000*	-1.000	-0.032*	-0.078*	1.000*	1.000
FED	-0.090	-0.072*	-0.254*	0.000*	0.080*	0.064*	1.000*	1.000	-0.014*	-0.044*	1.000*	1.000
FlowScore	-	-	-	-	-	-	-	-	-	-	-	-
FBD	-	-	-0.235*	-0.400*	-	-	-1.000*	-1.000	-	-	-1.000*	-1.000

DSTC6 Data

	DSTC6
	Turn-Level		System-Level
	P	S	P	S
BLEU-4	0.131	0.298	-0.064*	0.050*
METEOR	0.307	0.323	0.633	0.084*
ROUGE-L	0.332	0.326	0.487	0.215*
ADEM	0.151	0.118	0.042*	0.347*
BERTScore	0.369	0.337	0.671	0.265*
BLEURT	0.326	0.294	0.213*	0.426*
QuestEval	0.188	0.242	-0.215*	0.206*
RUBER	0.114	0.092	-0.074*	0.104*
BERT-RUBER	0.204	0.217	0.825	0.093*
PONE	0.208	0.200	0.608	0.235*
MAUDE	0.195	0.128	0.739	0.217*
DEB	0.211	0.214	-0.261*	0.492
GRADE	0.119	0.122	0.784	0.611
DynaEval	0.286	0.246	0.342*	-0.050*
USR	0.184	0.166	0.432*	0.147*
USL-H	0.217	0.179	0.811	0.298*
DialogRPT	0.170	0.155	0.567	0.334*
Deep AM-FM	0.326	0.295	0.817	0.674
HolisticEval	0.001*	-0.004*	0.010	-0.002
PredictiveEngage	0.043	0.004*	-0.094*	-0.409*
FED	-0.106	-0.083	0.221*	0.322*
FlowScore	0.064	0.095	0.352*	0.362*
FBD	-	-	-0.481	-0.234*

PredictiveEngage-DailyDialog

	PredictiveEngage-DailyDialog
	Turn-Level
	P	S
QuestEval	0.296	0.341
MAUDE	0.104	0.060*
DEB	0.516	0.580
GRADE	0.600	0.622
DynaEval	0.167	0.160
USR	0.582	0.640
USL-H	0.688	0.699
DialogRPT	0.489	0.533
HolisticEval	0.368	0.365
PredictiveEngage	0.429	0.414
FED	0.164	0.159
FlowScore	-	-
FBD	-	-

HolisticEval-DailyDialog

	HolisticEval-DailyDialog
	Turn-Level
	P	S
QuestEval	0.285	0.260
MAUDE	0.275	0.364
DEB	0.584	0.663
GRADE	0.678	0.697
DynaEval	-0.023*	-0.009*
USR	0.589	0.645
USL-H	0.486	0.537
DialogRPT	0.283	0.332
HolisticEval	0.670	0.764
PredictiveEngage	-0.033*	0.060*
FED	0.485	0.507
FlowScore	-	-
FBD	-	-

FED Data

	FED
	Turn-Level		Dialog-Level
	P	S	P	S
QuestEval	0.037*	0.093*	-0.032*	0.080*
MAUDE	0.018*	-0.094*	-0.047*	-0.280
DEB	0.230	0.187	-0.130*	0.006*
GRADE	0.134	0.118	-0.034*	-0.065*
DynaEval	0.319	0.323	0.503	0.547
USR	0.114	0.117	0.093*	0.062*
USL-H	0.201	0.189	0.073*	0.152*
DialogRPT	-0.118	-0.086*	-0.221	-0.214
HolisticEval	0.122	0.125	-0.276	-0.304
PredictiveEngage	0.024*	0.094*	0.026*	0.155*
FED	0.120	0.095	0.222	0.320
FlowScore	-0.065*	-0.055*	-0.073*	-0.003*
FBD	-	-	-	-

DSTC9 Data

	DSTC9
	Dialog-Level		System-Level
	P	S	P	S
QuestEval	0.026*	0.043	0.604	0.527*
MAUDE	0.059	0.042*	0.224*	0.045*
DEB	0.085	0.131	0.683	0.473*
GRADE	-0.078	-0.070	-0.674	-0.482*
DynaEval	0.093	0.101	0.652	0.727
USR	0.019*	0.020*	0.149*	0.127*
USL-H	0.105	0.105	0.566*	0.755
DialogRPT	0.076	0.069	0.685	0.555*
HolisticEval	0.015*	0.002*	-0.019*	-0.100*
PredictiveEngage	0.114	0.115	0.809	0.664
FED	0.128	0.120	0.559*	0.391*
FlowScore	0.147	0.140	0.907	0.900
FBD	-	-	-0.669	-0.627

How to Add New Dataset

Let the name of the new dataset be sample

Create a directory data/sample_data, write a function load_sample_data as follow:

def load_sample_data(base_dir: str):
    '''
    Args: 
        base_dir: the absolute path to data/sample_data
    Return:
        Dict:
        {
            # the required items
            'contexts' : List[List[str]], # dialog context. We split each dialog context by turns. Therefore one dialog context is in type List[str].
            'responses': List[str], # dialog response.
            'references': List[str], # dialog references. If no reference in the data, please still give a dummy reference like "NO REF".
            "scores": List[float] # human scores.
            # add any customized items
            "Customized Item": List[str] # any additional info in the data. 
        }
    '''

Import the function in gen_data.py, and run with python gen_data.py --source_data sample

How to Add New Metrics

Let the name of the new metric be metric

Write a function gen_metric_data to transform and generate the data into the metric directory:

# input format 1
def gen_metric_data(data: Dict, output_path: str):
    '''
    Args:
        data: the return value of load_data functions e.g. {'contexts': ...}
        output_path: path to the output file
    '''

# input format 2
def gen_metric_data(data: Dict, base_dir: str, dataset: str):
    '''
    Args:
        data: the return value of load_data functions e.g. {'contexts': ...}
        base_dir: path to the output directory
        dataset: name of the dataset
    '''

We have two input formats. Just follow the one which is easier for you.

Import the function in gen_data.py and follow comments in the code to add the metric.

Then write a function read_metric_result to read the prediction of the metric:

def read_metric_data(data_path: str):
    '''
    Args:
        data_path: path to the prediction file

    Return:
        # You can choose to return list or dict
        List: metric scores e.g. [0.2, 0.3, 0.4, ...]
        or 
        Dict: {'metric': List # metric scores}
    '''

Import the function in read_result.py and follow comments in the code to add the metric.

Then just follow the previous evaluation instructions to evaluate the metric.

exe1023 / DialEvalMetrics

readme

A Comprehensive Assessment of Dialog Evaluation Metrics

Prerequisties

Data Preparation

DSTC6 Data

DSTC9 Data

Engage Data

Fed Data

Grade Data

Holistic Data

USR Data

Metric Installation

Running Notes for Specific metrics

bert-as-service

running USR and FED

How to evaluate

Results

USR Data

GRADE Data

DSTC6 Data

PredictiveEngage-DailyDialog

HolisticEval-DailyDialog

FED Data

DSTC9 Data

How to Add New Dataset

How to Add New Metrics