Thanks for open-sourcing this! I tested the repo with a small job and found that the 0-shot MMLU score remains 0.22945449366187154 throughout the training. I checked the dumped evaluation details and found that the negative log-likelihood is always 0. For example,
{'doc_id': 0,
'doc': {'question': " Just war theory's principle of military necessity belongs to",
'subject': 'moral_disputes',
'choices': ['jus in bello.',
'jus ad bellum.',
'moral nihilism.',
'all of the above'],
'answer': 0},
'target': 0,
'arguments': [["The following are multiple choice questions (with answers) about moral disputes.\n\nJust war theory's principle of military necessity belongs to\nA. jus in bello.\nB. jus ad bellum.\nC. moral nihilism.\nD. all of the above\nAnswer:",
' A'],
["The following are multiple choice questions (with answers) about moral disputes.\n\nJust war theory's principle of military necessity belongs to\nA. jus in bello.\nB. jus ad bellum.\nC. moral nihilism.\nD. all of the above\nAnswer:",
' B'],
["The following are multiple choice questions (with answers) about moral disputes.\n\nJust war theory's principle of military necessity belongs to\nA. jus in bello.\nB. jus ad bellum.\nC. moral nihilism.\nD. all of the above\nAnswer:",
' C'],
["The following are multiple choice questions (with answers) about moral disputes.\n\nJust war theory's principle of military necessity belongs to\nA. jus in bello.\nB. jus ad bellum.\nC. moral nihilism.\nD. all of the above\nAnswer:",
' D']],
'resps': [[[0.0, True]], [[0.0, True]], [[0.0, True]], [[0.0, True]]],
'filtered_resps': [[0.0, True], [0.0, True], [0.0, True], [0.0, True]],
'doc_hash': 'bb0de79f79411c47783968714ec9fe3c69d89753e22c88f044420a7e00049a15',
'prompt_hash': 'c0f269e09cf44177328b16fb43734d6675416710a24286fea49dfad5663d2fb4',
'target_hash': '5feceb66ffc86f38d952786c6d696c79c2dbc239dd4e91b46729d73a27fb57e9',
'acc': 1.0}
The other benchmarks seem normal, e.g. (piqa)
{'doc_id': 0,
'doc': {'goal': "How do I ready a guinea pig cage for it's new occupants?",
'sol1': 'Provide the guinea pig with a cage full of a few inches of bedding made of ripped paper strips, you will also need to supply it with a water bottle and a food dish.',
'sol2': 'Provide the guinea pig with a cage full of a few inches of bedding made of ripped jeans material, you will also need to supply it with a water bottle and a food dish.',
'label': 0},
'target': 0,
'arguments': [["Question: How do I ready a guinea pig cage for it's new occupants?\nAnswer:",
' Provide the guinea pig with a cage full of a few inches of bedding made of ripped paper strips, you will also need to supply it with a water bottle and a food dish.'],
["Question: How do I ready a guinea pig cage for it's new occupants?\nAnswer:",
' Provide the guinea pig with a cage full of a few inches of bedding made of ripped jeans material, you will also need to supply it with a water bottle and a food dish.']],
'resps': [[[-120.0, False]], [[-122.5, False]]],
'filtered_resps': [[-120.0, False], [-122.5, False]],
'doc_hash': 'ab177c9b9ad0fd48149e873e3d4804752991338a90c2072f52b975f86a7ca78e',
'prompt_hash': '14e2e90bdc64add59c76a88c2efb217a132b459947f9f2e3bbe8580b71beb533',
'target_hash': '5feceb66ffc86f38d952786c6d696c79c2dbc239dd4e91b46729d73a27fb57e9',
'acc': 1.0,
'acc_norm': 1.0}
Hey thank you for the issue ! Let me try to reproduce that and see what is the issue. We've only recently switched to lm-harness from our internal evals so some things might be broken, sorry about that!
Thanks for open-sourcing this! I tested the repo with a small job and found that the 0-shot MMLU score remains 0.22945449366187154 throughout the training. I checked the dumped evaluation details and found that the negative log-likelihood is always 0. For example,
The other benchmarks seem normal, e.g. (piqa)
The saved eval config is
It might be because MMLU's answer is only one token and the eval script misses the last token? Looking forward to fixes.