composable-models / llm_multiagent_debate

ICML 2024: Improving Factuality and Reasoning in Language Models through Multiagent Debate
304 stars 42 forks source link

A question about eval_mmlu.py #5

Open 1ittlesnow opened 1 year ago

1ittlesnow commented 1 year ago

In the compute_accuracy function in eval_mmlu.py, there is a line of code on line 86 that reads if pred_answer is None: return 1. However, if pred_answer is None, shouldn't the function return 0 instead of 1? Doesn't this indicate that the prediction is incorrect?

yilundu commented 1 year ago

Hi -- thanks for catching that -- yes I will update the code soon. Essentially -- early on when testing MMLU, it seemed like pred_answer would sometimes be None. With later iterations of the prompt + multiagent debate, the answer was never None across agents, so I forgot about this line of the code.

1ittlesnow commented 12 months ago

Hi -- thanks for catching that -- yes I will update the code soon. Essentially -- early on when testing MMLU, it seemed like pred_answer would sometimes be None. With later iterations of the prompt + multiagent debate, the answer was never None across agents, so I forgot about this line of the code.

By the way, I noticed that in the code you provided, only 100 examples are chosen for generation. However, in the paper, it is not clear whether the reported accuracy is based on the 100 MMLU examples used in the code or on all MMLU examples. Can you provide more information or context to clarify this?

yilundu commented 12 months ago

Its only evaluated on 100 examples because the debate procedure is a bit computationally expensive -- it's also discussed in Appendix A.2 of the paper

1ittlesnow commented 12 months ago

Thank you so much.

1ittlesnow commented 12 months ago

Its only evaluated on 100 examples because the debate procedure is a bit computationally expensive -- it's also discussed in Appendix A.2 of the paper

Could you please provide your 100 examples in your experiment for mmlu and GSM8K?

yilundu commented 12 months ago

I think the 100 questions should be automatically chosen if you run the code with the (fixed) seed

1ittlesnow commented 12 months ago

Thank you. I mean the question with response of chatgpt.

yilundu commented 12 months ago

Ahh I see -- I think this dropbox link should have some logs : https://www.dropbox.com/sh/6kq5ixfnf4zqk09/AABezsYsBhgg1IQAZ12yQ43_a?dl=0

1ittlesnow commented 12 months ago

Thank you very much. ^v^