amazon-science / tofueval

MIT No Attribution
26 stars 2 forks source link

Local Image TofuEval: Evaluating Hallucinations of LLMs on Topic-Focused Dialogue Summarization

Authors: Liyan Tang, Igor Shalyminov, Amy Wing-mei Wong, Jon Burnsky, Jake W. Vincent, Yu'an Yang, Siffi Singh, Song Feng, Hwanjun Song, Hang Su, Lijia Sun, Yi Zhang, Saab Mansour, Kathleen McKeown

Update: Our work has been accepted to NAACL 2024 πŸŽ‰! Please check out our work here πŸ“ƒ

This repository contains the annotations for the released benchmark dataset TofuEval. Note that this is an evaluation benchmark. Data in the benchmark should not be used in training NLP models.

On 05.03.2024, an identifier 64-character string is added to each instance in TofuEval to assist in future detection of contamination in web-crawl corpora.

Documents in TofuEval

We provide the dev/test splits of TofuEval and document identifier in document_ids_dev_test_split.json, which can be used to obtain the source documents from MediaSum and MeetingBank. You can extract and preprocess the source documents by your own with the following links to the original data repository.

Or you can use the following code snippet to extract the documents used in TofuEval:

from datasets import load_dataset
import json
import pandas as pd

def obtain_dialogue_mediasum(dialogue_selected):
    dialogue_df = pd.DataFrame(columns=['doc_id', 'source'])
    for dialogue in dialogue_selected:
        dialogue_id = dialogue['id']
        speakers = dialogue['speaker']
        utts = dialogue['utt']
        transcript = ''
        for speaker, utt in zip(speakers, utts):
            transcript += f"{speaker}: {utt}\n"
        transcript = transcript.strip()
        dialogue_df.loc[len(dialogue_df)] = [dialogue_id, transcript]
    return dialogue_df

with open("document_ids_dev_test_split.json") as file:
    document_mapping = json.load(file)

meetingbank_dev_ids = document_mapping['dev']['meetingbank']
meetingbank_test_ids = document_mapping['test']['meetingbank']
mediasum_dev_ids = document_mapping['dev']['mediasum']
mediasum_test_ids = document_mapping['test']['mediasum']

meetingbank = pd.DataFrame(load_dataset("lytang/MeetingBank-transcript")['test'])
meetingbank_dev = meetingbank[meetingbank.meeting_id.isin(meetingbank_dev_ids)][['meeting_id', 'source']].reset_index(drop=True).to_csv("meetingbank_dev_doc.csv", index=False)
meetingbank_test = meetingbank[meetingbank.meeting_id.isin(meetingbank_test_ids)][['meeting_id', 'source']].reset_index(drop=True).to_csv("meetingbank_test_doc.csv", index=False)

with open("/path/to/news_dialogue.json") as file:
    news_dialogue = json.load(file)
dialogue_dev = [dialogue for dialogue in news_dialogue if dialogue['id'] in mediasum_dev_ids]
dialogue_test = [dialogue for dialogue in news_dialogue if dialogue['id'] in mediasum_test_ids]

mediasum_dev = obtain_dialogue_mediasum(dialogue_dev).to_csv("mediasum_dev_doc.csv", index=False)
mediasum_test = obtain_dialogue_mediasum(dialogue_test).to_csv("mediasum_test_doc.csv", index=False)

Factual Consistency Annotation

factual_consistency/{dataset}_factual_eval_{split}.csv contains factual consistency evaluations by expert linguistic annotators. The followings are descriptions of column names.

Col. name Description
doc_id The document id of a source document.
annotation_id The index of the source document in TofuEval.
topic Topic used to generate topic-focused summaries.
model_name Model used to generate the summary. Models are anynomized and the orders of models for all topics are shuffled.
sent_idx The sentence index in model generated summaries.
summ_sent {sent_idx}-th summary sentence by {model_name}. A full summary can be aggregated by joining these summary sentences using {sent_idx}.
sent_label yes if the summary sentence is factual consistent, no otherwise.
exp Human written explanation for why {summ_sent} is factually inconsistent.
type Human annotated error type(s) for {summ_sent}. A sentence can have multiple error types.

Update: Extra Annotations

We have extended TofuEval with factual consistency annotations for one more model (Model-Extra). The latest version of TofuEval contains annotations for 6 models (1.8K summaries and 5K summary sentences)!

Completeness Annotation

completeness/{dataset}_completeness_final.csv contains human written key points for each topic. The followings are descriptions of column names.

Col. name Description
doc_id the document id of a source document.
annotation_id The index of the source document in TofuEval.
topic topic used to generate topic-focused summaries.
key_points human written key points.

Topic Categorization

topic_category/{dataset}_topic_category.json categorizes each topic into main or marginal.

Citation

If you found the benchmark useful, please consider citing our work.

@Inproceedings{Tang2024,
 author = {Liyan Tang and Igor Shalyminov and Amy Wong and Jon Burnsky and Jake Vincent and Yu’an Yang and Siffi Singh and Song Feng and Hwanjun Song and Hang Su and Justin Sun and Yi Zhang and Saab Mansour and Kathleen McKeown},
 title = {TofuEval: Evaluating hallucinations of LLMs on topic-focused dialogue summarization},
 year = {2024},
 url = {https://www.amazon.science/publications/tofueval-evaluating-hallucinations-of-llms-on-topic-focused-dialogue-summarization},
 booktitle = {NAACL 2024},
}