bin123apple / AutoCoder

We introduced a new model designed for the Code generation task. Its test accuracy on the HumanEval base dataset surpasses that of GPT-4 Turbo (April 2024) and GPT-4o.
https://arxiv.org/abs/2405.14906
Apache License 2.0
790 stars 69 forks source link

What information is left out during inference from the output of model.generate()? #11

Closed KevinH48264 closed 3 months ago

KevinH48264 commented 3 months ago

Thank you for the great work and sharing this repo! One question: What information is left out during inference from the output of model.generate()?

For context, after we run model.generate, we get the final answer for the code generation.

However, from the fine-tuning dataset proposed in the AIEV-Instruct paper, AutoCoder is fine-tuned on a multi-turn dialogue dataset. Therefore, we expect the output of model.generate to be a multi-turn dialogue as well, but it isn't -- it appears to be a single-turn answer.

What information is left out? Does model.generate implicitly go through the process of 1) generating an initial answer, 2) having a code executor run the test cases, and then 3) giving feedback and posing a question if it didn't pass the test case? And if so, how many rounds of this cycle is employed during model.generate?

Thank you again for the great work. :)

bin123apple commented 3 months ago

Thanks for your question. If the user asks the model to verify the generated code (for example, if you say something like: "Please help me to verify your code" or "Could you help to verify your code," etc.), it will automatically start to test the code.

The whole verification process will go through the process steps (1), (2), and (3) as you mentioned above, and it will stop when all the test cases pass, the maximum try limitation is reached, or the model thinks it cannot finish it.

This function is similar to GPT-4's code interpreter. It will only start the code verification when the user explicitly asks the model to do so. This provides more flexible options for the users.

KevinH48264 commented 3 months ago

Thanks for the response!

For more clarification, I'm most curious about the HumanEval+ evaluation script. When it calls model.generate, the prompt only says something like: "_Write a solution to the following problem:\n python\n from typing import List\n\n\ndef has_close_elements(numbers: List[float], threshold: float) -> bool:\n """ Check if in given list of numbers, are any two numbers closer to each other than\n given threshold.\n >>> has_close_elements([1.0, 2.0, 3.0], 0.5)\n False\n >>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)\n True\n """\n_"

This doesn't include "Please help me verify your code", and the output is simply the python code snippet, appearing as a single turn.

However, the fine-tuning dataset is multi-turn, but we don't see the multi-turn in the output of model generate, so we have no idea if model.generate() is 1) truly just a single-turn Q&A style with no executor validated feedback, or if 2) it engages in a multi-turn executor validated dialogue. Which one is model.generate() following for the HumanEval+ eval script?

bin123apple commented 3 months ago

Please see our paper Section 5, first paragraph. To ensure a fair comparison with other models and reduce experimental randomness, we disabled AutoCoder’s external code interpreter during the tests and used greedy sampling. This means that it is only a single-turn Q&A style here. Thanks

KevinH48264 commented 3 months ago

Right, thanks for affirming that. Just a final clarification, between expected vs actual behavior during _Evaluation/testhumaneval.py currently and I was wondering if you could help correct my understanding of what the actual behavior is?

Expected behavior after fine-tuning on multi-turn dialogue:

  1. User prompts a question
  2. Fine-tuned model generates an output
  3. Model generates predicted execution result
  4. Model generates a) questions and repeats to step 2 if it predicts at least one "test case will fail, the maximum try limitation is reached, or the model thinks it cannot finish it" OR b) feedback on predicted success on test cases
  5. Program returns step 2 output

Current behavior after fine-tuning on multi-turn dialogue from running HumanEval+ script:

  1. User prompts a question
  2. Fine-tuned model generates an output
  3. Program returns step 2 output

So steps 3 and 4 are currently not found in the "output" variable after model.generate(). Which one is the actual behavior of model.generate()? And please correct one of the behaviors if it's close, but not completely accurate as my understanding is currently limited.

If I had to guess, it seems like what's happening is the expected behavior path, but also relying on the model to generate predicted execution results without a code interpreter is an interesting choice and it's a little hard to see how a model can reliably predict success or failure on test cases without compiling and executing. If this is the case though, this is pretty impressive and suggests a strong ability for the model to predict test case outcomes without running the test cases -- but again, this does seem difficult and a bit unlikely. And if so, I would expect performance to increase by allowing the code interpreter to run test cases during HumanEval+ performance as I expect test case results could be hallucinated without a code interpreter in the current format.

bin123apple commented 3 months ago

The execution result is NOT predicted by the model, it is generated by the program. During the fine-tuning, the execution result will not be used to compare with the ground truth and calculate the final loss. We want to teach the model how to analysis the execution result instead of generating the execution result on its own.

model.generate() has the same function as other models. It will only generate a single-turn output. The code interpreter function of AutoCoder is included in the Web_demo/chatbot.py

If you ask the model to verify its code, the whole process would be:

  1. User prompts a question
  2. Fine-tuned model generates an output
  3. Program returns step 2 execution results
  4. Model analysis the execution results and provide the further modification.
  5. If the execution results is good, model will provide the final code.

For the humaneval test in Evaluation/test_humaneval.py, the AutoCoder will not include any code interpreter function. This is for the fair comparisons (Because other models will not have this kind of external program during their testing on humaneval dataset).

Thanks!

KevinH48264 commented 3 months ago

Ok just to make sure I'm understanding HumanEval test in Evaluation/test_humaneval.py, AutoCoder will not include any code interpreter function, which means Q1: we just get the following from the process you described above, but with step 4 and 5 skipped? So effectively a single turn response?

  1. User prompts a question
  2. Fine-tuned model generates an output
  3. Program returns step 2 execution results -- (single turn)
  4. [SKIP] Model analysis the execution results and provide the further modification.
  5. [SKIP] If the execution results is good, model will provide the final code.

The reason why I was confused by this and seeking clarity was because I was worried that fine-tuning on multi-turn dialogue would teach the model to first output "wrong" answers, but learn to correct the answer from execution results -- after all, you're fine-tuning on multi-turn dialogue so you could expect it to not perform well for single-turn dialogues.

Therefore, it's a bit confusing to me how the fine-tuned model without execution results still performs well on single-turn dialogues, like in the case of HumanEval, when it was fine-tuned on multi-turn dialogues? I wouldn't expect the first single-turn answer to be correct because the first output in the multi-turn dialogues it is trained on is usually not the correct final answer. Q2: Could you help clarify where I'm misunderstanding? 🙏

bin123apple commented 3 months ago

Good question! I also thought about this question while training the model.

For Q1:

In Evaluation/test_humaneval.py, the Step 3 is also skipped, which means that it only contains step1 and step2.

For Q2:

I also worried about if the multi-turn dialogues will lead the model to a wrong answer, because the first turn response is incorrect. Thus, I did some dataset cleaning work after I got the raw multi-turns dataset. For example, for a part of data entries which contains several rounds of execution feedback, I only kept the code and the execution results from the last round (Which means that this is the correct code).

However other multi-turn dialogues are kept unchanged. This is because that I found that if I changed all the data entries to a single execution feedback, the model would lose the ability to response to the wrong execution results.

If you don't care about the Code Interpreter ability, I think there should be no problem for only using the code from the last round of the execution feedback.

Thanks!

KevinH48264 commented 3 months ago

Ah this makes a lot more sense, thank you for the clarification!

So 2 cases are supported by AIEV-Instruct: 1 is a single turn answer, and 2 is a multi-turn dialogue.

I suppose lastly I am wondering: for the HumanEval single-turn evaluation case, why does the AutoCoder model perform better than other supervised fine-tuning models if these models are effectively all fine-tuned on question + final correct code pairs? Even though there's multi-turn dialogue fine-tuning, the most applicable fine-tuning to HumanEval for AutoCoder is really the single turn dialogue fine-tuning examples from dataset cleaning -- at least this is what I currently think.

bin123apple commented 3 months ago

I think for other large code datasets, it hard to make sure that they are 100% correct. Most of them are annotated by using some large teacher models such as GPT-4. AIEV-Instruct tried to improved their accuracy by adding executions feedback and some unit tests.

It is also hard to say that AIEV-Instruct has a 100% correct accuracy, because some test cases may not be covered in the generated unit tests. I think this is one of the reason that AutoCoder performs well on HumanEval but not that good on HumanEval+.

On Thu, Jun 6, 2024 at 1:19 AM Kevin Huang @.***> wrote:

Ah this makes a lot more sense, thank you for the clarification!

So 2 cases are supported by AIEV-Instruct: 1 is a single turn answer, and 2 is a multi-turn dialogue.

  • For 1, fine-tuning operates as traditionally on Q&A pairs for single-turn inferences cases like HumanEval+.
  • For 2, fine-tuning operates on the multi-turn dialogue, but it seems like the initial question is appended with a variant of "Please help me verify your code" so that it engages in the multi-turn dialogue system and signals that it's okay to output an initial wrong answer (otherwise without appending to the initial question, there's a high risk of an initial wrong answer being outputted). So there's still a high chance that the single-turn inference is right, but by engaging the execution feedback and questioning cycle, there's an even higher chance the final answer is correct.

I suppose lastly I am wondering: for the HumanEval single-turn evaluation case, why does the AutoCoder model perform better than other supervised fine-tuning models if these models are effectively all fine-tuned on question + final correct code pairs? Even though there's multi-turn dialogue fine-tuning, the most applicable fine-tuning to HumanEval for AutoCoder is really the single turn dialogue fine-tuning examples from dataset cleaning -- at least this is what I currently think.

— Reply to this email directly, view it on GitHub https://github.com/bin123apple/AutoCoder/issues/11#issuecomment-2151431941, or unsubscribe https://github.com/notifications/unsubscribe-auth/AX2L2B3MEJBGVRZJA2TQ35LZF7WNNAVCNFSM6AAAAABIVKLM5KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNJRGQZTCOJUGE . You are receiving this because you commented.Message ID: @.***>

KevinH48264 commented 3 months ago

Right that makes sense. Thank you for all the clarifications over my misunderstandings, this is much more clear.

Thanks again for the great work!