The team of the HELM paper just shared a data set of doc-summary faithfulness ratings in this issue.
The rating is binary and was crowd sourced. The rated docs are from cnn and xsum. The summaries are references or created by some recent models (gpt3 etc). I think this could be integrated into aggrefact to get an even bigger and better benchmark.
I would be interested in discussing opinions whether this is a fit to be integrated into aggrefact and what to consider while doing so.
The team of the HELM paper just shared a data set of doc-summary faithfulness ratings in this issue. The rating is binary and was crowd sourced. The rated docs are from cnn and xsum. The summaries are references or created by some recent models (gpt3 etc). I think this could be integrated into aggrefact to get an even bigger and better benchmark.
I would be interested in discussing opinions whether this is a fit to be integrated into aggrefact and what to consider while doing so.