evaluation should support calculated summaries

Overview

In #91, we changed how summary labels for claims were generated -- instead of the LLM determining them, they are now calculated with a scoring system.

We did not at the time update promptfoo evaluation to support this change, so the evaluation still expects the summary to already be in the labels dict.

We should add the summary as soon as the LLM inference is complete.

Requirements

The function to generate a claim summary should be run on claims as soon as they are run through the LLM, in promptfoo evaluation.
The labelled data should also have this label added when it is loaded in.

FullFact / health-misinfo-shared

evaluation should support calculated summaries #154

Overview

Requirements

Notes and additional information