AF Survey - Reflexion: Language Agents with Verbal Reinforcement Learning

Reflexion: Language Agents with verbal Reinforcement Learning

Review author: Pranav Guruprasad

Summary:

The authors of this paper propose Reflexion - a framework for language-based agents, that converts binary or scalar feedback from the environment that the agent interacts with, into verbal feedback in the form of a textual summary. This framework aims to reinforce the language agents without updating weights. The agents that are augmented with the Reflexion framework, reflect on their task’s feedback signals and maintain a verbal feedback version of the signals in an episodic memory buffer, which is then used to improve decision making in subsequent trials.

The authors develop a modular framework for Reflexion, consisting of 3 distinct modules:

Actor: Built upon an LLM, and is prompted to generate text/actions conditioned on state observations. Authors experiment with CoT and ReAct for actor models. They also add a memory component that provides additional context to this actor.
Evaluator: Assesses the quality of outputs generated by the actor. Computes a reward score reflecting the actor’s generated trajectory.
Self-reflection: Given a sparse reward signal (such as a binary success status), the current trajectory generated by the actors and its memory, the self-reflection module generates specific feedback, which is more informative than scalar feedback.

Another core component of the Reflexion process is memory, consisting of short-term and long-term memory. During inference, the actor conditions its decisions on these memory components. The two memory components work together to provide context that is influenced by lessons learned over several trials.

The Reflexion process:

Actor produces a trajectory by interacting with the environment, and utilizing the short-term (most recent trajectory) and long-term (past self-reflections) memory components.
Evaluator produces a score based on the generated trajectory. This score is only a scalar reward.
Self-Reflection model analyzes the trajectory and scalar reward, to produce a verbal summary, which is then stored in memory.

The above 3 steps are carried out iteratively until the evaluator deems the trajectory generated by the actor to be correct.

Authors show that Reflexion agents improve on decision-making AlfWorld tasks over the baseline by 22%, on reasoning questions in HotPotQA by 20%, and Python programming tasks on HumanEval by up to 11%, thus proving that Reflexion achieves improvements over strong baselines across various tasks, and SOTA in a few of these tasks. They also introduce LeetcodeHardGym, a code generation RL gym environment consisting of 40 Leetcode ‘hard’ questions in 19 programming languages.

Motivation:

An alternative to using in-context examples as a way of teaching agents, and traditional gradient-based optimization schemes like RL which require substantial amounts of compute and time
Self-reflective feedback acts as a semantic gradient signal and provides the agent with much more information and direction on how to improve, when compared to a scalar error signal
Encourage explainability in decision making for LLMs

Experiments and Results:

The authors evaluate various natural language RL agents on decision-making, reasoning, and code generation tasks. Specifically search-based question answering on HotPotQA, for which Reflexion shows an improvement of 20% over the baseline; multi-step tasks in household environments in AlfWorld, for which Reflexion shows an improvement of 22% over the baseline; and code writing tasks in competition-like environments with interpreters and compilers in HumanEval, for which Reflexion shows a 11% improvement over the baseline

Sequential decision making in AlfWorld:
- AlfWorld - suite of text-based environments that challenge an agent to solve multi-step tasks in interactive environments based on TextWorld
- Agents are run in 134 AlfWorld environments
- ReAct is used as the action generator
- Two self evaluation techniques are used to determine requirement of self-reflection - natural language classification using an LLM, and a hand-written heuristic
  - Heuristic - if agent executes same action and receives same response for 3 cycles, or if the number of actions in the current environment exceed 30, agent is made to self-reflect
- Baseline - self-reflection is skipped
- Result: ReAct+Reflexion significantly outperforms baseline ReAct
- Authors report the reason for Reflexion’s success as its ability to distill long, failed trajectories into relevant experiences that are then used as self-hints in the future
Reasoning with the HotPotQA dataset:
- HotPotQA - Wikipedia-based dataset with question-answer pairs that challenge agents to parse content and reason over several supporting documents
- Authors test improvement in reasoning-only ability by implementing Reflexion+CoT
- To test question-answering ability, authors implement a Reflexion+ReAct agent that can retrieve relevant context using Wikipedia API and infer answers using step-by-step explicit thinking
- Baseline - CoT only for reasoning, ReAct only for question-answering
- Result: Reflexion outperforms all baseline approaches by significant margins over several learning steps
- Authors conduct ablation experiments to show that self-reflection offers a good boost over solely adding an element of episodic memory containing the most recent trajectory. Thus, proving that refinement-only approaches are not as effective as self-reflection-guided refinement approaches.
Programming:
- Python and Rust coding datasets - MBPP, HumanEval, LeetcodeHardGym
- MBPP and HumanEval - measure function body generation accuracy given natural language descriptions
- Setup for the learning loop for a Reflexion programming agent is identical to reasoning and decision-making agents used in previous experiments. Test suite generated using CoT prompting.
- Result: Reflexion outperforms all baseline accuracies and sets new SOTA on all benchmarks for Python and Rust, expect MBPP Python

Limitations:

Can this approach truly replace learning? While authors pose Reflexion as a replacement to learning via updating weights, they do not compare the performance of Reflexion agents with the performance of fine-tuned agents. Observing the difference in performance between Reflexion and fine-tuned agents, can help decide when to implement either one of the methods over the other, depending on performance requirements and compute restrictions.
Reflexion uses natural language to do policy optimization, and even though policy optimization is a powerful approach, it is vulnerable to non-optimal local minima solutions
Long-term memory is limited to a naive sliding window approach, which restricts the amount of information that can be utilized for further actions by agents.
While LLM-based reasoning and self-reflection approaches improve explainability, they also have the potential to lead to a negative cascading effect as there is no formal guarantee for success. By hallucinating during reasoning or self-reflection, and storing this in the memory, the agent will tend to perform worse over the iterations for complex tasks. Grounding or confidence thresholding mechanisms are not explored to curb this issue.
Authors do not explore the evaluation of the quality of the self-reflection by the LLMs. Coming up with metrics and datasets for this, benchmarking the same, and quantitatively measuring the performance of the self-reflection step, can provide immense opportunities for future work in LLM-based agents that utilize self-reflection.

Significance:

Reflexion offers a lightweight alternative to fine-tuning large language models
Allows detailed, nuanced feedback compared to typical scalar/vector rewards that are challenging to perform credit assignment with
Allows explicit and interpretable episodic memory
Provides human-interpretable hints for actions in future episodes
Important step towards explainability in the decision-making process of LLMs

Future work:

Exploring metrics and methods to evaluate and benchmark LLM-based reasoning and self-reflection
Comprehensive experiments to observe the difference in performance between fine-tuning and the Reflexion approach. Does the Reflexion approach do much worse than fine-tuning? Or does it perform close enough to fine-tuning such that the amount of compute, capital, and time required can greatly be reduced.
Explore more intelligent and efficient methods to obtain information for the long-term memory component compared to the sliding window approach
Explore techniques to ground, and increase confidence and truthfulness, of self-reflection and reasoning generated by LLMs

Related work:

Retroformer: Retrospective Large Language Agents with Policy Gradient Optimization by Yao et al. (Salesforce Research) - This paper introduces a principled framework for reinforcing large language agents by learning a retrospective model, which automatically tunes the language agent prompts from environment feedback through policy gradient. Specifically, the agent architecture learns from rewards across multiple environments and tasks, for fine-tuning a pre-trained language model which refines the language agent prompt by summarizing the root cause of prior failed attempts and proposing action plans. Relevant because the approach of this paper on a high-level is very similar to the approach used in Reflexion, in terms of using LLM-based self-reflection to improve performance of LLM-based agents in future iterations. However, Retroformer uses a smaller, local language model that can be fine-tuned under low-resource settings, and an iterative policy gradient optimization step specifically designed to reinforce the reflection model with a gradient-based approach. In Reflexion on the other hand, the LLM used for self-reflection is not fine-tuned and maintains its pre-trained weights, and the policy optimization is not a gradient-based approach.

ExpeL: LLM Agents are Experiential Learners by Zhao et al. (Tsinghua University) - The agent introduced in this paper autonomously gathers experiences and extracts knowledge using natural language from a collection of training tasks. Experiences are nothing but interactions with the environment. At inference, the agent recalls its extracted insights and past experiences to make informed decisions. Relevant because the agent uses Reflexion to gather diverse experiences that can be useful to extract information from, by continuously retrying the the training task multiple times until success. Additionally, by extracting insights from past experiences, and using it in combination with past experiences themselves, the agent implements a form of retrospective learning, similar to Reflexion.

Paper link: Reflexion: Language Agents with Verbal Reinforcement Learning

ManifoldRG / Manifold-KB

AF Survey - Reflexion: Language Agents with Verbal Reinforcement Learning #23