Open notrichardren opened 2 months ago
Based on my understanding, LM-eval-harness is not able to do the cross-group analysis required for Anthropic’s discrim eval. Each metric is applied to each individual prompt, and then it is aggregated in a way that doesn’t account for differences in prompts.
For now, I'm “rescuing” all the results and saving them so I can process them outside lm-eval-harness (shown in code above).
A bit hacky, but maybe you can pass your results from metric computation as a tuple (logit_diff, doc grouping id, {any other info?}) and have a custom aggregation aggregate across each group and report the final aggregated score?
I'd like to make this possible to implement--will try to take a closer look asap.
Makes sense. I may also want to report different results for various groups (e.g. race, gender, etc.), while my impression was that an aggregation returns a single number
Implementing Anthropic's discrimination evaluation requires evaluating logit differences among groups like age, gender, race, etc. This seems difficult to implement with "metrics" and "aggregation", as there doesn't seem to be a way to have the age/gender/race information leak through.
Is there something I'm missing about lm-eval-harness's features that would allow for an easier implementation?
YAML file:
Utils.py file