AIPHES / DiscoScore

DiscoScore: Evaluating Text Generation with BERT and Discourse Coherence
32 stars 6 forks source link

Usage example? #1

Closed JohnGiorgi closed 1 year ago

JohnGiorgi commented 2 years ago

Hi!

Is there an example of how to use this metric for evaluation? I would be interested in instantiating some objects and computing the DisoScore score between a reference and generated summary. Ideally a usage something like BERTScore or BARTScore

andyweizhao commented 2 years ago

@JohnGiorgi thanks for your interest! I will add a running example in 1-2 days :))

JohnGiorgi commented 2 years ago

Awesome! Thanks a lot

andyweizhao commented 2 years ago

@JohnGiorgi,

I have added an example in a way as in BARTScore. The code supports 6 discourse metrics, including DiscoScore. The details of these metrics are provided in Appendix A.1 in the paper.

Note that if system and reference texts do not contain coherence phenomena (e.g., no word repetition), then the discourse metrics would return 0.

JohnGiorgi commented 2 years ago

Awesome! Thank you @andyweizhao. A couple of questions if that's okay!

andyweizhao commented 2 years ago

@JohnGiorgi

  1. The recommended models are Conpono (the discourse variant of BERT) for DS-Focus and BERT-NLI for DS_SENT. I have slightly adjusted the text in README.md.
  2. DiscoScore uses F1, and thus takes into account both directions. But the input arguments of m(x, y) cannot be swapped, b/c the current code is written in a way that assumes the 2nd argument is a set of multi-references.
  3. If I remember correctly, the results of DiscoScore in reference-free settings are not strong.
andyweizhao commented 2 years ago

About F1, DiscoScore uses Precision by default, but one can enable Recall and F1 by slightly adjusted the code.