AF Survey - "Towards A Unified Agent with Foundation Models"

Paper Review: Towards a Unified Agent with Foundation Models

Summary Large language models (LLM) and vision language models (VLM) are used in conjunction with reinforcement learning (RL) agents via a framework that uses language as the core reasoning tool. The RL challenges addressed include efficient exploration, reusing experience data, scheduling skills, and learning from observations, all of which would otherwise require separate, vertically designed algorithms. The method is tested in a sparse-reward simulated robotic manipulation environment modeled with the MuJoCo physics simulator, wherein a robot arm interacts with an environment composed of a red, a blue, and a green object in a basket. The problem is formalized as a Markov Decision Process (MDP), where:

state space represents the 3D position of objects and the end-effector
action space is composed of an x, y position (reached through inverse kinematics) where the robot arm either pickups or places an object
observation space is composed of images taken from two cameras fixed to the edges of the basket
agent receives a language description of the task
positive reward is awarded if the episode is successful.

The framework is designed so that agents:

map visual inputs to text descriptions—done via CLIP, a large, contrastive visual-language model
prompt an LLM with textual descriptions and a description of the task to provide language instructions—LLM used is FLAN-T5, which is finetuned on datasets of language instructions; LLM given a description of the environment setting and asked to find sub-goals that would lead to solving a proposed task; two examples of such tasks and relative sub-goals decomposition is sufficient for the LLM to emulate the desired behavior
ground output of the LLM into actions—via a language-conditioned policy network parameterized as a transformer that takes an embedding of the language subgoal and the state of the MDP at the current timestep as input and outputs an action for the robot to execute at the next timestep; the network is trained from scratch in an RL loop
learn from interaction with the environment—via a method inspired by the Collect & Infer paradigm
- Collect phase—N distributed, parallel agents interact with the environment, collects data, and predict actions through a policy network; VLM is used to infer if subgoals have been encountered in the collected data and extract additional rewards (beyond potential rewards at the end of episodes); agents store episodes with rewards in a shared experience buffer
- Infer phase—policy is trained through Behavioural Cloning on experience buffer after each agent has completed an episode; updated weights of policy are shared with all distributed agents Experiments demonstrate improvements over baselines in exploration efficiency and ability to reuse data from offline datasets.

Motivations The size and abilities of LLMs and VLMs has resulted in the terms Foundation Models, as they can be used for different downstream applications. This and the observation in these models of common sense reasoning, proposing and sequencing sub-goals, visual understanding, and other properties—which are all characteristic of agents that interact with and learn from environments—motivates the use of these models to bootstrap the process of designing such agents.

Experiments In order to test exploration and curriculum generation through language, the agent was compared against a baseline agent when performing tasks in the virtual environment:

Stack X on Y
Triple Stack

The challenges investigated were:

Exploration (with curriculum generation through language)
- tested with Stack X on Y and Triple Stack tasks against a baseline agent that learns only through environment rewards The results were:
- learning curves are more efficient than baseline on all tasks
- agent’s learning curve rapidly grows in Triple Stack task compared against baseline, which hasn’t received a single reward in the same timeframe
- agent’s number of steps needed to achieve a certain success rate grows more slowly than the sparseness of the task, because of the increase in the amount of sub-goals proposed by the LLM as the task becomes sparser
Reusing past experience data
- via equipping the agent with a lifelong/offline buffer that stores each episode of interaction data across all tasks, and with a new task buffer re-initialized at the beginning of each new task and filled with trajectories that result in a reward from goal or sub-goal completion
- the agent is trained successively on three different Stack X on Y tasks and uses the VLM to determine whether an episode from the lifelong/offline buffer fits LLM-generated subgoals for the current task and should therefore be added to the new task buffer The results:
- Each new task is learned (as in, the agent reaches 50% success rate) faster than the preceding task
Scheduling and reusing skills
- the agent can learn a series of skills that can each be described as a language goal (e.g. “The green object is on top of the red object;” the skill represented by this goal allows the agent to place the green object on top of the red object)
- the framework can be used to decompose a task into a list of shorter subgoals
- a subgoal corresponds to a learned skill; the agent uses the policy network to translate a subgoal/skill into actions
- the VLM is used to compute if the goal of the current skill has been reached, at which point the next skill is executed
Learning from observation: mapping videos to skills
- the agent takes a video of a human stacking the objects with their hand and uses the VLM and its textual description of its learned skills to detect what subgoals are encountered by the human
- the agent executes the list of sub-goals in the same way as in the previous experiment

Limitations The results are intended as a proof-of-concept regarding the applications of LLMs and VLMs towards RL and therefore does not represent an exhaustive survey of LLM/VLM-RL integration. Also, the agent was compared against a baseline agent rather than any kind of SOTA agent that would allow for a comparison of LLM/VLM-based RL agents against SOTA RL techniques. There is also no SOTA metrics to compare against. In particular:

The use of intrinsic rewards or other exploration bonuses is not compared against the agent’s approach
The planner is assumed to be able to effectively decompose goals into subgoals; there are no tasks that test robustness of goal decomposition
It is assumed that as sparseness increases, the number of subgoals will keep pace; this is borne out in the experiments (the number of subgoals goes from one, to two, to four as the tasks grow more complicated) but it is not explicitly tested and the relationship between sparseness and subgoals needs more discussion

Significance This agent demonstrates applications of LLMs and VLMs towards RL problems. While the results are not compared against SOTA for similar tasks, they are nonetheless striking and demonstrate intuitive advantages that can be provided by a foundation model approach. Sparseness and a lack of intrinsic rewards can be overcome by sub-goal decomposition, and these sub-goals along with a LLM/VLM framework can be used to bootstrap learning for new but similar tasks. The results go against the general trends of RL agents, as generally the number of steps needed to learn a task grows more quickly than the sparseness of the task.

Future Work The researchers are have listed a number of future directions that are particular to their implementation, such as generalizing the state space and action space of the MDP, finetuning CLIP in a more general way, and testing the framework in real-world environments rather than just simulation. For our purposes, we can also investigate comparisons between these approaches and SOTA RL approaches, and their efficacy in our intended environments.

Paper Link: Towards a Unified Agent with Foundation Models

ManifoldRG / Manifold-KB

AF Survey - "Towards A Unified Agent with Foundation Models" #10

Paper Review: Towards a Unified Agent with Foundation Models