fladhak / creative-summ-data

9 stars 1 forks source link

Datasets for the Creative-Summ 2022 Workshop

This repository contains the datasets for the shared task of the Automatic Summarization for Creative Writing (Creative-Summ) workshop at COLING 2022.

More information can be found at https://creativesumm.github.io/sharedtask.

Shared Task Overview

The CreativeSumm 2022 shared task is divided into four sub-tasks, namely:

The training data for each sub-task comes from existing, well-established datasets (see below), but for the movie and television sub-tasks we will provide new, unseen test inputs for evaluation.

Corpora

Novel Chapters/BookSum (Ladhak et al., 2020; Kryściński et al., 2021)

This dataset pairs chapters of novels released as part of Project Gutenberg with corresponding summaries. For this shared task, we provide the novel chapters here. We unfortunately cannot provide the summaries, as the study guide websites are copyrighted. Each novel chapter in the provided data, however, does have a link to the page where the summary text may be found.

Please see the associated papers Ladhak et al. (2020) and Kryściński et al. (2021) papers for more information on how they collected the summaries.

Notes:

Scriptbase (Gorinski & Lapata, 2015)

This dataset pairs movie transcripts with their corresponding Wikipedia summaries. The data may be downloaded from here. See the main repository for additional information. We've split the dataset into train and validation splits, and the list of movies associated with each split can be found here.

NOTE: The input for this task is the movie script (the script.txt file) and the target summary is the plain text synopsis from Wikipedia (the processed/wikiplot.txt file).

SummScreen, Forever Dreaming (Chen et al., 2022)

This dataset pairs TV transcripts from primetime shows with their corresponding Wikipedia summaries. We will use the version of this data associated with the SCROLLS Benchmark (Shaham et al., 2022), and you may download the data there. Please see the notes below for important additional information!

Notes:

SummScreen, TV Megasite (Chen et al., 2022)

This dataset pairs soap opera transcripts with summaries written by TV Megasite contributors. We have preprocessed the data so that it is in the same format as the Forever Dreaming data (i.e., it follows SCROLLS conventions), and it may be downloaded here.

Notes:

References

Mingda Chen, Zewei Chu, Sam Wiseman, Kevin Gimpel. 2022. SummScreen: A Dataset for Abstractive Screenplay Summarization. In ACL.

Philip John Gorinski, Mirella Lapata. 2015. Movie Script Summarization as Graph-based Scene Extraction. In NAACL.

Wojciech Kryściński, Nazneen Rajani, Divyansh Agarwal, Caiming Xiong, Dragomir Radev. 2021. BookSum: A Collection of Datasets for Long-form Narrative Summarization.

Faisal Ladhak, Bryan Li, Yaser Al-Onaizan, Kathleen McKeown. 2020. Exploring Content Selection in Summarization of Novel Chapters. In ACL.