Open 1ittlesnow opened 1 year ago
Hi -- thanks for catching that -- yes I will update the code soon. Essentially -- early on when testing MMLU, it seemed like pred_answer would sometimes be None. With later iterations of the prompt + multiagent debate, the answer was never None across agents, so I forgot about this line of the code.
Hi -- thanks for catching that -- yes I will update the code soon. Essentially -- early on when testing MMLU, it seemed like pred_answer would sometimes be None. With later iterations of the prompt + multiagent debate, the answer was never None across agents, so I forgot about this line of the code.
By the way, I noticed that in the code you provided, only 100 examples are chosen for generation. However, in the paper, it is not clear whether the reported accuracy is based on the 100 MMLU examples used in the code or on all MMLU examples. Can you provide more information or context to clarify this?
Its only evaluated on 100 examples because the debate procedure is a bit computationally expensive -- it's also discussed in Appendix A.2 of the paper
Thank you so much.
Its only evaluated on 100 examples because the debate procedure is a bit computationally expensive -- it's also discussed in Appendix A.2 of the paper
Could you please provide your 100 examples in your experiment for mmlu and GSM8K?
I think the 100 questions should be automatically chosen if you run the code with the (fixed) seed
Thank you. I mean the question with response of chatgpt.
Ahh I see -- I think this dropbox link should have some logs : https://www.dropbox.com/sh/6kq5ixfnf4zqk09/AABezsYsBhgg1IQAZ12yQ43_a?dl=0
Thank you very much. ^v^
In the compute_accuracy function in eval_mmlu.py, there is a line of code on line 86 that reads
if pred_answer is None: return 1
. However, if pred_answer is None, shouldn't the function return 0 instead of 1? Doesn't this indicate that the prediction is incorrect?