Open mathfinder opened 2 months ago
For initial testing, it’s recommended to start with easier tasks like the 5-shot methods (e.g., mmlu_stem_mc_5shot or mmlu_humanities_mc_5shot_test). These are useful for evaluating the model’s ability to generalize with a few examples. However, for less capable models, it is not recommended to rely on multiple-choice (MC) tasks right away, as they may not perform well. The focus should be on simpler tasks to gauge the model’s baseline performance before moving to more complex evaluations like MC.
❓ The question
I found that you provide many mmlu test methods. Take
mmlu_stem
as an example, includingmmlu_stem_test
,mmlu_stem
,mmlu_stem_var
,mmlu_stem_mc_5shot
,mmlu_humanities_mc_5shot
,mmlu_humanities_mc_5shot_test
. Which one is more recommended?