Instead of accepting aggregation type, we should always return all available aggregations. It's probably also a good idea to automate "pass@k" and majority metric (might need some opt-in and light customization from each metric class to support it properly) and only ask to implement a "first" aggregation to make things simpler and reuse more code
Instead of accepting aggregation type, we should always return all available aggregations. It's probably also a good idea to automate "pass@k" and majority metric (might need some opt-in and light customization from each metric class to support it properly) and only ask to implement a "first" aggregation to make things simpler and reuse more code