Open bfaught3 opened 10 months ago
Summary Large language models (LLM) and vision language models (VLM) are used in conjunction with reinforcement learning (RL) agents via a framework that uses language as the core reasoning tool. The RL challenges addressed include efficient exploration, reusing experience data, scheduling skills, and learning from observations, all of which would otherwise require separate, vertically designed algorithms. The method is tested in a sparse-reward simulated robotic manipulation environment modeled with the MuJoCo physics simulator, wherein a robot arm interacts with an environment composed of a red, a blue, and a green object in a basket. The problem is formalized as a Markov Decision Process (MDP), where:
state space represents the 3D position of objects and the end-effector
action space is composed of an x, y position (reached through inverse kinematics) where the robot arm either pickups or places an object
observation space is composed of images taken from two cameras fixed to the edges of the basket
agent receives a language description of the task
positive reward is awarded if the episode is successful.
The framework is designed so that agents:
map visual inputs to text descriptions—done via CLIP, a large, contrastive visual-language model
prompt an LLM with textual descriptions and a description of the task to provide language instructions—LLM used is FLAN-T5, which is finetuned on datasets of language instructions; LLM given a description of the environment setting and asked to find sub-goals that would lead to solving a proposed task; two examples of such tasks and relative sub-goals decomposition is sufficient for the LLM to emulate the desired behavior
ground output of the LLM into actions—via a language-conditioned policy network parameterized as a transformer that takes an embedding of the language subgoal and the state of the MDP at the current timestep as input and outputs an action for the robot to execute at the next timestep; the network is trained from scratch in an RL loop
learn from interaction with the environment—via a method inspired by the Collect & Infer paradigm
Motivations The size and abilities of LLMs and VLMs has resulted in the terms Foundation Models, as they can be used for different downstream applications. This and the observation in these models of common sense reasoning, proposing and sequencing sub-goals, visual understanding, and other properties—which are all characteristic of agents that interact with and learn from environments—motivates the use of these models to bootstrap the process of designing such agents.
Experiments In order to test exploration and curriculum generation through language, the agent was compared against a baseline agent when performing tasks in the virtual environment:
The challenges investigated were:
Limitations The results are intended as a proof-of-concept regarding the applications of LLMs and VLMs towards RL and therefore does not represent an exhaustive survey of LLM/VLM-RL integration. Also, the agent was compared against a baseline agent rather than any kind of SOTA agent that would allow for a comparison of LLM/VLM-based RL agents against SOTA RL techniques. There is also no SOTA metrics to compare against. In particular:
Significance This agent demonstrates applications of LLMs and VLMs towards RL problems. While the results are not compared against SOTA for similar tasks, they are nonetheless striking and demonstrate intuitive advantages that can be provided by a foundation model approach. Sparseness and a lack of intrinsic rewards can be overcome by sub-goal decomposition, and these sub-goals along with a LLM/VLM framework can be used to bootstrap learning for new but similar tasks. The results go against the general trends of RL agents, as generally the number of steps needed to learn a task grows more quickly than the sparseness of the task.
Future Work The researchers are have listed a number of future directions that are particular to their implementation, such as generalizing the state space and action space of the MDP, finetuning CLIP in a more general way, and testing the framework in real-world environments rather than just simulation. For our purposes, we can also investigate comparisons between these approaches and SOTA RL approaches, and their efficacy in our intended environments.
Paper Link: Towards a Unified Agent with Foundation Models
Hey @bfaught3 thanks for creating an issue for this!
Could you please add description for the issue in your first post and mention things according to community guidelines described here : https://docs.google.com/document/d/1LPCl8ivbPQsEx96sGBPeCY7AM8PWh7RUcy-TNNJkQ50/edit#heading=h.dwwjnh416i3
Thanks in advance!