Closed lxning closed 2 months ago
I think this is because the fewshot context is not constructed properly, which I have included the details here (#2196).
@liewziqin you mentioned a fix, in https://github.com/EleutherAI/lm-evaluation-harness/pull/2238 but the issue still persist due to fewshot context, is that right?
@lintangsutawika Yes, I think the issue is not the regex pattern, currently the fewshot samples in our prompt do not include CoT context and "answer is" which is essential for the regex to work. Sample of current prompt:
The following are multiple choice questions (with answers) about other. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.Question:\nAs of 2017, how many of the world’s 1-year-old children today have been vaccinated against some disease? *\nOptions:\nA. 30%\nB. 60%\nC. 10%\nD. 90%\nE. 80%\nF. 40%\nG. 100%\nH. 50%\nI. N/A\nJ. N/A\nAnswer: Let's think step by step. E\n\nQuestion:\nWhich one of the following items is an example of nonmaterial culture?\nOptions:\nA. A dove feather\nB. Dove symbol\nC. Dove body lotion\nD. Dove deodorant\nE. Dove soap\nF. Dove candy bar\nG. Dove conditioner\nH. A dove (bird).\nI. Dove chocolate\nJ. Dove shampoo\nAnswer: Let's think step by step. B\n\nQuestion:\nWhich of the following cases established the precedent that a defendant must be informed of the right to remain silent, the right to a lawyer, and protection from self-incrimination?\nOptions:\nA. Brown v. Board of Education\nB. Miranda v. Arizona\nC. Roe v. Wade\nD. Betts v. Brady\nE. Plessy v. Ferguson\nF. Dred Scott v. Sandford\nG. Weeks v. United States\nH. Gideon v. Wainwright\nI. Marbury v. Madison\nJ. Mapp v. Ohio\nAnswer: Let's think step by step. B\n\nQuestion:
As shown in the sample, the answer format in our fewshot samples is "Answer: Let's think step by step. E", the model might not output the answer in the format of "answer is ([A-J])". Therefore I think we should correct the fewshot context construction as per (#2196).
@liewziqin from the latest code, the input prompt seems contain the "the answer is xxx".
@eyuansu62 I see, thanks for informing, I haven't tried out latest code yet
However, the latest code is generating unexpected blank spaces between each section, like this: 'The answer is (D).\n\n Question:'. I believe it would be better if there were no blank spaces between '\n\n' and 'Question'.
I've gone ahead and merged #2238 which also includes a fix related to the blank space after \n\n
.
I got score 0 when I ran the following command. However, the response in the log_sample does contain "answer is (A)".
Command:
sample log