allenai / open-instruct

Apache License 2.0
1.1k stars 145 forks source link

Possible unreliability of MMLU estimates #87

Closed Kong-Aobo closed 5 months ago

Kong-Aobo commented 7 months ago

By reading the source code, the evaluation method for MMLU I think is as follows: comparing the first character generated by the model with the true label. I have a question about this. Though you add "Answer: " to the question, the model usually do not output options directly. Below is a example of related code.

    results = query_openai_chat_model(
        engine=args.openai_engine,
        instances=instances,
        batch_size=args.eval_batch_size if args.eval_batch_size else 10,
        output_path=os.path.join(args.save_dir, f"{subject}_openai_results.jsonl"),
        logit_bias={token_id: 100 for token_id in answer_choice_ids},
        max_tokens=1,   # Here
    )

Is my concern reasonable? Could you provide a relatively detailed explanation? It seems that there are no very reliable methods for extracting answers under unsupervised conditions in the field of instruction tuning.

hamishivi commented 5 months ago

Hi - I think your concern is well-founded, and if you want an interesting read on the many ways you can implement MMLU and what it can do, I recommend this piece on the HF blog. Our implementation is designed to hew closer to the original MMLU repository and match the numbers reported in the original Llama paper, but this is not necessarily the "one correct" way to do these things!

In the future, we hope to add more evaluations and so try to evaluate a model from many possible perspectives to cases like these, where looking at one number for one task with one prompt is unfair for some models.