dennlinger / summaries

A toolkit for summarization analysis and aspect-based summarizers
MIT License
11 stars 0 forks source link
analysis python summarization

summaries: A Toolkit for the Summarization Ecosystem

Author: Dennis Aumiller
Heidelberg University

Reproducibility of German Summarization Dataset Experiments

Part of this library has been officially accepted as a long paper at BTW'23! If you are interested in reproducing the contents of this work, see the file REPRODUCIBILITY.md.

Installation

During development you can install this framework by following the steps below:

  1. Clone this github repository: git clone https://github.com/dennlinger/summaries.git
  2. Navigate to the repository folder, and install all necessary dependencies: python3 -m pip install -r requirements.txt
  3. Set up the library with python3 -m pip install .. If you want an automatically updated development version, you can also add -e to the command.

You can now import the library with import summaries

Usage

For some of the functionalities, there are existing scripts in examples/ illustrating the basic use, or experiments/, documenting some concrete experimentation surrounding different (predominantly German) summarization datasets.

Pre-Processing Data

Often overlooked is a sensible exploratory data analysis and thorough data pre-processing when working in a ML context. The summaries package provides a number of functionalities surrounding this aspect, with a particular focus on summarization-specific filters and analysis functions.

summaries.Analyzer

The main purpose of the Analyzer class is to serve a collection of different tools for inspecting datasets both at the level of a singular sample or the entire subset of training/validation/test splits. Currently, the Analyzer offers the following functionalities:

Code example of detecting a faulty summarization sample:

from summaries import Analyzer

analyzer = Analyzer(lemmatize=True, lang="en")

# An invalid summarization sample
reference = "A short text."
summary = "A slightly longer text."

print(analyzer.is_summary_longer_than_reference(summary, reference, length_metric="char"))
# True

summaries.analysis.Stats

An additional module similar to Analyzer, but more focused on dataset-wide computation of length statistics.

Offers the following functions:

summaries.Cleaner

By itself, the Analyzer can already be used to streamline exploratory data analysis, however, more frequently the problematic samples should directly be removed from the dataset. For this purpose, the library provides summaries.Cleaner, which internally uses a number of functionalities from Analyzer to remove samples. In particular, for the main functionality Cleaner.clean_dataset(), it takes different splits of a dataset (splits are entirely optional), and will remove samples based on set criteria. For inputs, Cleaner either accepts a list of dict-like data instances, or alternatively splits derived from a Huggingface datasets.dataset. Additionally, the function will print a distribution of filtered sample by reason for filtering.

Currently, the following filters are applied:

Duplications are expressed as four different types:

  1. exact_duplicate, where the exact combination of (reference, summary) has been encountered before.
  2. both_duplicate, where both the reference and summary have been encountered before, but in separate instances.
  3. reference_duplicate, where only the reference has been encountered before.
  4. summary_duplicate, where only the summary has been encountered before.

Code example of filtering a Huggingface dataset:

from datasets import load_dataset
from summaries import Analyzer, Cleaner

analyzer = Analyzer(lemmatize=True, lang="de")
cleaner = Cleaner(analyzer, min_length_summary=20, length_metric="char", extractiveness="fully")

# The German subset of MLSUM has plenty of extractive samples that need to be filtered
data = load_dataset("mlsum", "de")

clean_data = cleaner.clean_dataset("summary", "text", data["train"], data["validation"], data["test"])

AspectSummarizer

The main functionality is a summarizer that is based around a two-stage framework, that starts with a topical extraction component (keyphrase extraction at the moment), and uses these keyphrases as queries in a second stage retriever.

Currently, there are the following options for the respective Extractor and Retriever components:

Per default, the AspectSummarizer will retriever k sentences for each of N topics. For single document summarization use cases, the resulting list of sentences will be ordered by the original sentence order, and also remove any duplicate sentences (this can occur if a sentence is relevant for several different topics).

Alignment Strategies

For the creation of suitable training data (on a sentence level), it may be necessary to create alignments between source and summary texts. In this toolkit, we provide several approaches to extract alignments.

RougeNAligner

This method follows prior work (TODO: Insert citation) in the creation of alignments, based on ROUGE-2 maximization. There are slight differences, however. Whereas prior work uses a greedy algorithm that adds sentences until the metric is saturated, we proceed by adding a 1:1 alignment for each sentence in the summary. This has both the advantage of covering a wider range of the source text (for some summary sentences, alignments might appear relatively late in the text), however, at the cost of getting stuck in a local minimum. Furthermore, 1:1 alignments are not the end-all truth, since sentence splitting/merging are also frequent operations, which are not covered with this alignment strategy.

Usage:

from summaries.aligners import RougeNAligner

# Use ROUGE-2 optimization, with F1 scores as the maximizing attribute
aligner = RougeNAligner(n=2, optimization_attribute="fmeasure")
# Inputs can either be a raw document (string), or pre-split (sentencized) inputs (list of strings). 
relevant_source_sentences = aligner.extract_source_sentences(summary_text, source_text)

SentenceTransformerAligner

This method works similar in its strategy to the RougeNAligner, but instead uses a sentence-transformer model to compute the similarity between source and summary sentences (by default, this is paraphrase-multilingual-MiniLM-L12-v2).

Evaluation

Baseline Methods

The library provides unsupervised baselines for comparison. In particular, we implement the lead_3, lead_k and a modified LexRank baseline.

lead_3 and lead_k simply copy and return the first few sentences of the input document as a summary. lead_3 was mainly popularized by (Nallapati et al, 2016). Our own work introduces a variant that accounts for slightly longer contexts, which is espeically useful for long-form summaries (e.g., Wikipedia or legal documents), where 3 sentences vastly underestimates the expected output length.

For the lexrank_st baseline, we adapt the modification suggested by Nils Reimers, which replaces the centrality computation with cosine similarity over the segment embeddings generated by sentence-transformers models.

By default, all the baselines will utilize a language-specific tokenizer based on spaCy to segment the text into individual sentences. If you have extremely long inputs, I would recommend doing a paragraph-level split first yourself, and then passing the segmented inputs directly. The baselines can handle inputs of both formats natively.

Usage:

from summaries.baselines import lead_3, lexrank_st
import spacy

# specify the length of the lexrank summary in segments:
num_segments = 5

lead_3(input_text, lang="en")
lexrank_st(input_text, lang="en", num_sentences=num_segments)

# or, alternatively:
nlp = spacy.load("en_core_web_sm")
lead_3(input_text, processor=nlp)
lexrank_st(input_text, processor=nlp,num_sentences=num_segments)

# or, split the text yourself first:
text = [segment for segment in text]
lexrank_st(text, num_sentences=num_segments)

Significance Testing

For the sake of reproducible research, we also provide a simple implementation of paired bootstrap resampling, following (Koehn, 2004). It allows the comparison of two systems, A and B, on a gold test set. The hypothesis is that system A outperforms B. The returned score is the p-value.

Usage:

from summaries.evaluation import paired_bootstrap_test

# Replace with any metric of your choice, but make sure it takes
# litss of system and gold inputs and returns a singular float "score"
def accuracy(system, gold):
    return sum([s == g for s, g in zip(system, gold)]) / len(system)

# By default performs 10k iterations of re-sampling
paired_bootstrap_test(gold_labels,
                      system_a_predictions,
                      system_b_predictions,
                      scoring_fucntion=accuracy,
                      n_resamples=1000,
                      seed=12345)

Extending or Supplying Own Components

Citation

If you found this library useful, please consider citing the following work:

@inproceedings{aumiller-etal-2023-on,
  author    = {Dennis Aumiller and
               Jing Fan and
               Michael Gertz},
  title     = {{On the State of German (Abstractive) Text Summarization}},
  booktitle = {Datenbanksysteme f{\"{u}}r Business, Technologie und Web {(BTW}
               2023)},
  series    = {{LNI}},
  publisher = {Gesellschaft f{\"{u}}r Informatik, Bonn},
  year      = {2023}
}