Extreme narrow layout produced (nicematrix.sty)

Location in document: Unknown

Selected HTML:

3.4 Ablation Studies

{NiceTabular}

llrr Model Method Valid Test
8B Instruct – 8.9 10.5
Few-Shot 8.5 8.5
SFT 10.3 10.0
RLEF 17.2 16.0
70B Instruct – 25.9 27.5
Few-Shot 22.5 20.3
SFT 27.7 27.2
RLEF 37.5 40.1

(a)

{NiceTabular}

llrrrr Model Training Valid Test
ST MT ST MT
8B \ – 9.4 8.9 11.6 10.5
Instruct ST 10.3 10.2 9.9 10.9
MT 16.2 17.2 9.5 16.0
70B \ – 25.6 25.9 25.9 27.5
Instruct ST 28.3 31.1 27.3 32.9
MT 25.8 37.5 30.3 40.1

(b)

(c) 1@3 solve rates starting from Llama 3.1 models, temperature 0.2. (a) Comparison of different methods for acquiring the iterative code synthesis capabilities. RLEF is the most effective training method, followed by supervised fine-tuning (SFT). We find few-shot prompting to be detrimental to Instruct models. (b) Conventional single-turn (ST) compared to our multi-turn (MT) training with our RL loop. MT training yields larger improvements compared to ST, and improvements carrying over to multi-turn over single-turn inference is restricted to the 70B model.

3.4.1 Learning Iterative Code Synthesis

We investigate whether LLMs can, apart from our RL training, be effective in multi-turn code generation using few-shot prompting (Brown et al., 2020) and supervised fine-tuning (SFT). Lacking suitable ground truth training examples for SFT, we mine rollouts on the CodeContests training set with Llama 3.1 70B Instruct and filter them based on the correctness of final solutions. We then fine-tune Base and Instruct versions of the Llama 3.1 8B and 70B parameter models on the mined corpus and also source it for few-shot examples (Section A.3). The results in Section 3.4 show that few-shot prompting is detrimental to the instruction-tuned models. In Section B.1 we report few-shot 1@3 solve rates for pre-trained models and find that they achieve lower performance compared to zero-shot prompting for instruction models (1.2 and 1.8 for 8B, 4.6 and 5.8 for 70B on valid and test set, respectively). Supervised fine-tuning improves Instruct model performance on the validation set only; we do not see improvements on the test set. For pre-trained models, we see improvements from SFT but lower scores compared to instruction-tuned models (Section B.1). With RLEF we obtain significantly higher solve rates compared to SFT models, underscoring the efficacy of our RL training loop.

3.4.2 Single-turn Training

In Section 3.4 we compare our iterative code generation setup to traditional, single-turn generation where the model is not presented with inference-time feedback. We use the same training loop for single generations, albeit without the penalty for invalid code (Section 2.2) as this is subsumed by the reward signal for incorrect solutions. For Llama 3.1 Instruct 8B, single-turn training (ST) hurts performance on the test set. The 70B model benefits from single-turn training and improves over multi-turn SFT results in Section 3.4. Moreover, we observe transfer in that applying the model trained for single turns in a multi-turn setting improves 1@3 solve rates. We attribute this to the existent but comparabily weak multi-turn capabilities of the vanilla 70B Instruct model. Overall, we see strongest performance with the RLEF method employing multiple turns at training and inference time.

4 Related Work

Generating program code with LLMs to automate and assist software development has been studied extensively in recent years, with evaluations predominantly focusing on code synthesis from natural language descriptions (Clement et al., 2020; Chen et al., 2021; Austin et al., 2021). A major boost in performance is obtained by including large quantities of source code in pre-training and selecting or generating suitable data for subsequent fine-tuning for instruction following (Li et al., 2023; Gunasekar et al., 2023; Rozière et al., 2023; AI @ Meta, 2024).

More recently, several works investigated prompting and flow engineering techniques to improve performance at inference time, including the verification of generated code via compilation and execution, followed by re-prompting. Shinn et al. (2023) and Chen et al. (2024) use feedback from unit tests to correct previously wrong generations and found it crucial to include model-generated error analysis in the prompt for successive generations. LDB (Zhong et al., 2024), AlphaCodium (Ridnik et al., 2024) and MapCoder (Islam et al., 2024) can be regarded as agentic frameworks as they provide rich manual scaffolding for code generation, chaining several LLM calls (e.g., for chain-of-thought planning, test generation, and program repair) combined with code execution. These approaches are effective on difficult benchmarks, such as the CodeContests dataset we consider in this work, but significantly increase inference cost by requiring dozens of LLM calls per solution.

Recent works highlight further issues with scaffolds like AlphaCodium or MapCoder. Olausson et al. (2024) show that sampling code solutions independently is competitive to repairing faulty code, that large models are required to provide effective feedback on errors, and that multiple rounds of repair are not effective. Kapoor et al. (2024) focus on inference cost and demonstrate that independent sampling beats the approaches from Shinn et al. (2023) and Zhong et al. (2024) when considering equal sampling budgets. With our method, the self-repair capabilities of LLMs can be dramatically enhanced, resulting in superior performance of iterative code generation for both small and large sample budgets. At the same time, we propose to trade complex, domain-specific prompt engineering and scaffolding for domain-specific fine-tuning.

Fine-tuning large language models with reinforcement learning is a popular method for aligning their output to user preferences (Ziegler et al., 2020; Touvron et al., 2023; OpenAI, 2023; DeepSeek-AI et al., 2024; AI @ Meta, 2024). Here, the learning signal is provided by special-purpose reward models. For code synthesis, however, rewards can be determined by executing LLM generations against available test cases. Le et al. (2022) pre-train an LLM for code generation and subsequently fine-tune it with both policy gradients and next-token loss on rewards from execution. For policy rollouts, they perform program refinement and repair to increase the likelihood of sampling correct programs; however, their model is trained on final solutions only. Shojaee et al. (2023) utilize a rich reward signal that considers execution as well as similarity to ground truth code, introducing a dependency on human-provided solutions. Finally, Xu et al. (2024) fine-tune a stronger, code-specific LLM in a simpler setup with a binary reward from unit tests and observe substantial improvements from RL on the difficult competitive programming benchmark we consider here. We likewise propose a simple setting without extra inference scaffolding or usage of ground truth solutions. Crucially, we expand the traditional natural-language-to-code setting to an iterative environment where execution feedback is not only provided as a scalar reward but also in textual form. This allows us to shift focus from large-sample inference regimes to obtaining high accuracy with low sample budgets. Concurrently to our work, Kumar et al. (2024) propose a two-stage RL method (SCoRe) to improve the self-correction capabilities of LLMs and train them to output two successive solutions. In contrast to our method, SCoRe does not leverage execution feedback at inference time and instead asks the model to reconsider its initial solution. While this approach allows for potential applications to domains where automatic feedback is not available, it cannot benefit from the information provided in the feedback message. Furthermore, inference-time feedback can help the model generalize to new environments after training.

Past work on applying reinforcement learning to LLMs on longer-horizon decision-making tasks placed an emphasis on acquiring the necessary grounding in the environment. Carta et al. (2023) report that RL tuning with PPO (Schulman et al., 2017) is superior to supervised training for grounding in text-based navigation games as measured by successful task completions. Zhou et al. (2024) propose a family of RL algorithms for LLMs and test them in text games (versus an oracle LLM) and for buying produces using a simplified web shop API, and Zhai et al. (2024) tackle environments with visual observations, adapting the parameters of a pre-trained vision LLM. While our work follows similar motivations, we address a fundamentally different domain – code synthesis – which features a significantly larger action space compared to previous work, i.e., the space of valid Python programs.

5 Conclusion

In this work, we proposed reinforcement learning from execution feedback (RLEF), a fine-tuning method for LLMs that endows them with a crucial capability for autonomous operation: grounding future generations in environment feedback. We applied RLEF to iterative code synthesis and obtained substantial improvements in solve rates on the CodeContests competitive programming benchmark while reducing the required sample budget for inference. The RLEF-trained models further generalize to increased turn limits and to HumanEval+ and MBPP+, two popular code generation benchmarks that exhibit simpler programming questions and different execution feedback formatting. Our in-depth analysis revealed that, while an increase in correct first-turn generations and in the diversity of successive generations offers a major contribution of performance, our models also meaningfully take execution feedback into account and resolve errors over multiple turns.

Limitations.

While our results demonstrate effective usage of inference-time feedback, the code synthesis task we consider is limited to improving a single solution to a given problem. Generalizing our method to environments with larger tasks that require decomposition, either via manual scaffolding or, eventually, in a self-directed manner, remains the subject of further research.

Broader Impact.

Successful grounding of LLMs for code generation execution feedback will amplify their utility when applied to impactful tasks such as assisting software development and performing quality control. In general, however, increasing the capabilities of LLMs, now widely deployed in a range of applications, requires quality control and guard-railing to promote safety and minimize potentially harmful output. We limit our study to the generation of source code, where we confine the execution of model-generated output to local sandboxes. We believe the framework of Shavit et al. (2023) regarding the governance of AI agents to be a useful resource for practitioners.

Reproducibility Statement.

We perform all experiments with publicly available models and datasets. Section 3.1 describes the dataset and pre-processing steps, the exact Llama model versions used, and details our evaluation metric. The loss function and hyper-parameters for training, as well as a description of the compute infrastructure can be found in Section A.1. Section A.3 describes (narrow) hyper-parameter ranges for supervised fine-tuning, and Section A.2 contains notes regarding code execution during training and evaluation. All prompts are listed in Appendix C.

Acknowledgements.

We thank Quentin Carbonneaux, Chris Cummins, Olivier Duchenne, Fabian Gloeckle, Baptiste Roziere, Sten Sootla, Nicolas Usunier, and Sida Wang for helpful technical contributions, suggestions, and insightful discussions.

References

AI @ Meta (2024)

arXiv / html_feedback