About the generation metric for questions with multiple correct answers

caiqizh commented 4 weeks ago

I am wondering for questions with multiple correct answers (or those with many alternative answers), can the current generation metrics handle this?

Thank you for the great codebase!

rvashurin commented 2 weeks ago

Hi, @caiqizh!

Indeed, it can be done with current codebase. You can look it up in example for TriviaQA. If there's multiref option set to true in the yaml config, like here: https://github.com/IINemo/lm-polygraph/blob/main/examples/configs/polygraph_eval_triviaqa.yaml#L24, all generation metrics will be wrapped into AggregatedMetric class (https://github.com/IINemo/lm-polygraph/blob/main/src/lm_polygraph/generation_metrics/aggregated_metric.py). This will apply aggregation (currently only max) to values of generation metric produced by comparing model's output to all references. For Accuracy, for example, this means that for a given input, Accuracy metric will be equal to 1 if any of the alternative references match the output of the model.

We really need to work on updating and expanding documentation... @IINemo

rvashurin commented 2 weeks ago

I will keep this issue open until we ship docs with this functionality clearly explained.

IINemo / lm-polygraph

About the generation metric for questions with multiple correct answers #207