DocAsRef

TL;DR: Using document as reference summary in summary evaluation

Read the Background and terminology first.

Usage

To run the experiments,

First install the dependcies
```
pip install -r requirements.txt
```
(optionally) Set the path to EvalBase, the evaluation framework used by DocAsRef, if you did not install EvalBase via pip but instead cloned locally.
Run the experiments
```
python3 experiment.py
```
Feel free to edit the experiment configurations in experiment.py, which has two sections ,
- The metrics to be benchmarked
- The datasets and evaluation settings.
Some metrics can be dis/enabled by directly (un)commenting correspondng lines in the experiment.py file. For other metrics, mostly variants of BERTScore-sentence, please (un)comment lines for their hyperparameters, e.g., weight_schemes = ["entropy", "sum"] for the weighting schemes of BERTScore-sentence with PageRank-style sentence weighting. The dictionary correpsonding to metrics enabled in each approach ends in the suffix _enabled. All enabled metrics are put together in the dictionary all_metrics_enabled.

File hierarchy

The code for each approach below are in their own folders. Each folder must have a metric.py file that defines

Either a dictionary metrics which maps a string, metric name, to a callable which is a summary metric function, or
A function create_metric() that wraps base summary metrics with additional features to create new variant metrics.

Optionally, a folder may have an eval.py file containing functions for defining the respective metrics.

Approaches

Approach 0: just replacing human summaries with documents

Metrics: BERTScore, ROUGE, BLEURT, MoverScore

Implemented in /classic/metric.py.

Initial results show that BERTScore can be very effective after being repurposed as a ref-free metric.

Approach 1: sentence-level, better similarity metrics, and better weighting methods

We propose to expand BERTScore from the token level to sentence level:

	BERTScore	Our changes
Comparison between	Token pairs	sentence pairs
Similarity metrics	cosine	NLI-based, semantically tell whether two sentences are related, could be trained on our own tasks
weighting scheme	IDF	semantic weighting

The document is a list of $n$ sentences: $D=[D_1, D_2, ..., D_n]$, while the system/generated summary (to be evaluated) is a list of $m$ sentences: $S=[S_1, S_2, ..., S_m]$. And $m < < n$.

A memory-saving pseudocode:

for D in all_documents:
  [D1, D2, D3] = sentence_segmenter(D) # break D into sentences
  [E_1, E_2, ...] = sentence_embedder([D1, D2, ...]) # embed each sentence in D
  for S in summaries_of_D (not all summaries, only those of D):
    [S1, S2, ...] = sentence_segmenter(S) # break an S into sentences 
    [E'_1, E'_2, ...] = sentence_embedder([S1, S2, ...]) # embed each sentence in S

    score = summary_scorer(E1, E2, ..., E'1, E'2)

Approach 1.1 Sentence embedding + cosine similarity, no weighting

Implemented in /bertscore_sentence/eval.py/compute_cos().

Approach 1.2 Sentence embedding + NLI-based similarity, no weighting

Sentence similarity measure via NLI probabilities. Implemented in mnli/eval.py/compute_mnli(). We send a pair of sentences (one from the document and the other from the system summary) to an NLI model, selected in mnli/classifier.py, that estimates three probabilities between the two sentences: entailing ($E$), contradictory ($C$), and neutral ($N$). From the three probabilities, we use different expressions (defined in ./mnli/sim_expr.py) to define sentence similarity: $E-C$, $1-N$, and $E$ itself. Three foundation models are experimented: roberta-large-mnli, facebook/bart-large-mnli, and microsoft/deberta-large-mnli.

Approach 1.3 (NOT IMPLEMENTED YET)

Suppose a list of system summary sentences $L_S$ and a list of document sentences $L_D$, then we find the difference in weights generated by evaluating the attention of documents in the two lists respectively, instead of two sentences for one time in Approach 1.2.

Approach 1.4 Sentence weighting

f1(D1, D2, ..., S1, S2, ...)

f2( f3(D1, S1, S2, ..), f3(D2, S1, S2, ..), ...., f3(Dn, S1, S2, ...) )

entropy ( sim(S1, D1), sim(S1, D2), ... )

entropy ( sim(S2, D1), sim(S2, D2), ... )

Original BERTScore uses IDF to weight tokens. When expanding BERTScore to the sentence level, we use a PageRank-style algorithm to weight sentences.

Implemented in /pagerank

Approach 1.5 Pseudo-reference by Top-K and Top-P

Due to the way that humans write summaries, the first few sentences in a document are more likely to be the most important ones. We use top-k (like first 3 sentences) and top-p (like the first 30% sentences) to select the first few sentences as pseudo-reference.

Implemented in top/

Approach 1.6 Pseudo-reference using decent enough summarizers

Instead of top-k and top-p in Approach 1.5, we use models google/pegasus-xsum and facebook/bart-large-cnn to generate pseudo-reference from documents.

Implemented in anyref/

Baselines

SUPERT, BLANC, SummaQA, SueNes.
BLEU, METEOR, BART, SDC*, Sentence-Mover-Distance (SMD). See ./baseline/ folder.
GPT-3.5 Based.

SigmaWe / DocAsRef

readme