facebookresearch / lingua

Meta Lingua: a lean, efficient, and easy-to-hack codebase to research LLMs.
BSD 3-Clause "New" or "Revised" License
4.06k stars 202 forks source link

mmlu evaluation not working #28

Open zhengyang-wang opened 1 week ago

zhengyang-wang commented 1 week ago

Thanks for open-sourcing this! I tested the repo with a small job and found that the 0-shot MMLU score remains 0.22945449366187154 throughout the training. I checked the dumped evaluation details and found that the negative log-likelihood is always 0. For example,

{'doc_id': 0,
 'doc': {'question': " Just war theory's principle of military necessity belongs to",
  'subject': 'moral_disputes',
  'choices': ['jus in bello.',
   'jus ad bellum.',
   'moral nihilism.',
   'all of the above'],
  'answer': 0},
 'target': 0,
 'arguments': [["The following are multiple choice questions (with answers) about moral disputes.\n\nJust war theory's principle of military necessity belongs to\nA. jus in bello.\nB. jus ad bellum.\nC. moral nihilism.\nD. all of the above\nAnswer:",
   ' A'],
  ["The following are multiple choice questions (with answers) about moral disputes.\n\nJust war theory's principle of military necessity belongs to\nA. jus in bello.\nB. jus ad bellum.\nC. moral nihilism.\nD. all of the above\nAnswer:",
   ' B'],
  ["The following are multiple choice questions (with answers) about moral disputes.\n\nJust war theory's principle of military necessity belongs to\nA. jus in bello.\nB. jus ad bellum.\nC. moral nihilism.\nD. all of the above\nAnswer:",
   ' C'],
  ["The following are multiple choice questions (with answers) about moral disputes.\n\nJust war theory's principle of military necessity belongs to\nA. jus in bello.\nB. jus ad bellum.\nC. moral nihilism.\nD. all of the above\nAnswer:",
   ' D']],
 'resps': [[[0.0, True]], [[0.0, True]], [[0.0, True]], [[0.0, True]]],
 'filtered_resps': [[0.0, True], [0.0, True], [0.0, True], [0.0, True]],
 'doc_hash': 'bb0de79f79411c47783968714ec9fe3c69d89753e22c88f044420a7e00049a15',
 'prompt_hash': 'c0f269e09cf44177328b16fb43734d6675416710a24286fea49dfad5663d2fb4',
 'target_hash': '5feceb66ffc86f38d952786c6d696c79c2dbc239dd4e91b46729d73a27fb57e9',
 'acc': 1.0}

The other benchmarks seem normal, e.g. (piqa)

{'doc_id': 0,
 'doc': {'goal': "How do I ready a guinea pig cage for it's new occupants?",
  'sol1': 'Provide the guinea pig with a cage full of a few inches of bedding made of ripped paper strips, you will also need to supply it with a water bottle and a food dish.',
  'sol2': 'Provide the guinea pig with a cage full of a few inches of bedding made of ripped jeans material, you will also need to supply it with a water bottle and a food dish.',
  'label': 0},
 'target': 0,
 'arguments': [["Question: How do I ready a guinea pig cage for it's new occupants?\nAnswer:",
   ' Provide the guinea pig with a cage full of a few inches of bedding made of ripped paper strips, you will also need to supply it with a water bottle and a food dish.'],
  ["Question: How do I ready a guinea pig cage for it's new occupants?\nAnswer:",
   ' Provide the guinea pig with a cage full of a few inches of bedding made of ripped jeans material, you will also need to supply it with a water bottle and a food dish.']],
 'resps': [[[-120.0, False]], [[-122.5, False]]],
 'filtered_resps': [[-120.0, False], [-122.5, False]],
 'doc_hash': 'ab177c9b9ad0fd48149e873e3d4804752991338a90c2072f52b975f86a7ca78e',
 'prompt_hash': '14e2e90bdc64add59c76a88c2efb217a132b459947f9f2e3bbe8580b71beb533',
 'target_hash': '5feceb66ffc86f38d952786c6d696c79c2dbc239dd4e91b46729d73a27fb57e9',
 'acc': 1.0,
 'acc_norm': 1.0}

The saved eval config is

harness:
  tasks:
  - hellaswag
  - nq_open
  - piqa
  - winogrande
  - arc
  - race
  - mmlu
  num_fewshot: null
  device: null
  use_cache: null
  cache_requests: false
  rewrite_requests_cache: false
  delete_requests_cache: false
  limit: null
  bootstrap_iters: 100000
  check_integrity: false
  write_out: false
  log_samples: true
  system_instruction: null
  apply_chat_template: false
  fewshot_as_multiturn: false
  gen_kwargs: null
  verbosity: INFO
  predict_only: false
  random_seed: 0
  numpy_random_seed: 1234
  torch_random_seed: 1234
  fewshot_random_seed: 1234
wandb: null
global_step: 60000

It might be because MMLU's answer is only one token and the eval script misses the last token? Looking forward to fixes.

BadrYoubiIdrissi commented 1 week ago

Hey thank you for the issue ! Let me try to reproduce that and see what is the issue. We've only recently switched to lm-harness from our internal evals so some things might be broken, sorry about that!

tanishqkumar commented 5 days ago

+1