Possible unreliability of MMLU estimates

allenai / open-instruct

Apache License 2.0

1.1k stars 145 forks source link

By reading the source code, the evaluation method for MMLU I think is as follows: comparing the first character generated by the model with the true label. I have a question about this. Though you add "Answer: " to the question, the model usually do not output options directly. Below is a example of related code.

    results = query_openai_chat_model(
        engine=args.openai_engine,
        instances=instances,
        batch_size=args.eval_batch_size if args.eval_batch_size else 10,
        output_path=os.path.join(args.save_dir, f"{subject}_openai_results.jsonl"),
        logit_bias={token_id: 100 for token_id in answer_choice_ids},
        max_tokens=1,   # Here
    )

Is my concern reasonable? Could you provide a relatively detailed explanation? It seems that there are no very reliable methods for extracting answers under unsupervised conditions in the field of instruction tuning.

allenai / open-instruct

Possible unreliability of MMLU estimates #87