Closed OyvindTafjord closed 2 months ago
Thanks Oyvind! Approving this PR.
Though it seems to me that not resetting label to 0 is fine for MMLU — MMLU’s prep_examples(), which is inherited from ICLMultiChoiceTaskDataset, does not skip cases where label_id and cont_id mismatch when metric=bpb (whereas OEEvalTask does), and ICLMetric.compute() does not use label_id when metric=bpb. So questions with any label_id should have already been included in the metric computation.
Same problem as fixed for oe-eval tasks in https://github.com/allenai/OLMo/pull/712, I forgot there was separate handling for MMLU.
With the previous code, only questions with gold answer A would be counted in the bpb evaluations, now they should all be counted.