TL;DR: Using document as reference summary in summary evaluation
Read the Background and terminology first.
To run the experiments,
pip install -r requirements.txt
Run the experiments
python3 experiment.py
Feel free to edit the experiment configurations in experiment.py
, which
has two sections ,
Some metrics can be dis/enabled by directly (un)commenting
correspondng lines in the experiment.py
file.
For other metrics, mostly variants of BERTScore-sentence,
please (un)comment lines for their hyperparameters, e.g., weight_schemes = ["entropy", "sum"]
for the weighting schemes of BERTScore-sentence with PageRank-style sentence weighting.
The dictionary correpsonding to metrics enabled in each approach
ends in the suffix _enabled
.
All enabled metrics are put together in the dictionary all_metrics_enabled
.
The code for each approach below are in their own folders.
Each folder must have a metric.py
file that defines
metrics
which maps a string, metric name, to a callable which is a summary metric function, or create_metric()
that wraps base summary metrics with additional features to create new variant metrics.Optionally, a folder may have an eval.py
file containing functions for defining the respective metrics.
Metrics: BERTScore, ROUGE, BLEURT, MoverScore
Implemented in /classic/metric.py
.
Initial results show that BERTScore can be very effective after being repurposed as a ref-free metric.
We propose to expand BERTScore from the token level to sentence level:
BERTScore | Our changes | |
---|---|---|
Comparison between | Token pairs | sentence pairs |
Similarity metrics | cosine | NLI-based, semantically tell whether two sentences are related, could be trained on our own tasks |
weighting scheme | IDF | semantic weighting |
The document is a list of $n$ sentences: $D=[D_1, D_2, ..., D_n]$, while the system/generated summary (to be evaluated) is a list of $m$ sentences: $S=[S_1, S_2, ..., S_m]$. And $m < < n$.
A memory-saving pseudocode:
for D in all_documents:
[D1, D2, D3] = sentence_segmenter(D) # break D into sentences
[E_1, E_2, ...] = sentence_embedder([D1, D2, ...]) # embed each sentence in D
for S in summaries_of_D (not all summaries, only those of D):
[S1, S2, ...] = sentence_segmenter(S) # break an S into sentences
[E'_1, E'_2, ...] = sentence_embedder([S1, S2, ...]) # embed each sentence in S
score = summary_scorer(E1, E2, ..., E'1, E'2)
Implemented in /bertscore_sentence/eval.py/compute_cos()
.
Sentence similarity measure via NLI probabilities. Implemented in mnli/eval.py/compute_mnli()
. We send a pair of sentences (one from the document and the other from the system summary) to an NLI model, selected in mnli/classifier.py
, that estimates three probabilities between the two sentences: entailing ($E$), contradictory ($C$), and neutral ($N$). From the three probabilities, we use different expressions (defined in ./mnli/sim_expr.py) to define sentence similarity: $E-C$, $1-N$, and $E$ itself. Three foundation models are experimented: roberta-large-mnli
, facebook/bart-large-mnli
, and microsoft/deberta-large-mnli
.
Suppose a list of system summary sentences $L_S$ and a list of document sentences $L_D$, then we find the difference in weights generated by evaluating the attention of documents in the two lists respectively, instead of two sentences for one time in Approach 1.2.
f1(D1, D2, ..., S1, S2, ...)
f2( f3(D1, S1, S2, ..), f3(D2, S1, S2, ..), ...., f3(Dn, S1, S2, ...) )
entropy ( sim(S1, D1), sim(S1, D2), ... )
entropy ( sim(S2, D1), sim(S2, D2), ... )
Original BERTScore uses IDF to weight tokens. When expanding BERTScore to the sentence level, we use a PageRank-style algorithm to weight sentences.
Implemented in /pagerank
Due to the way that humans write summaries, the first few sentences in a document are more likely to be the most important ones. We use top-k (like first 3 sentences) and top-p (like the first 30% sentences) to select the first few sentences as pseudo-reference.
Implemented in top/
Instead of top-k and top-p in Approach 1.5, we use models google/pegasus-xsum
and facebook/bart-large-cnn
to generate pseudo-reference from documents.
Implemented in anyref/
./baseline/
folder.