ManifoldRG / Manifold-KB

This repository serves as a knowledge base with key insights, details from other research and implementations to serve as references and one place to document various possible paths to achieve something.
GNU General Public License v3.0
4 stars 0 forks source link

AF Survey - "Towards A Unified Agent with Foundation Models" #10

Open bfaught3 opened 10 months ago

bhavul commented 10 months ago

Hey @bfaught3 thanks for creating an issue for this!

Could you please add description for the issue in your first post and mention things according to community guidelines described here : https://docs.google.com/document/d/1LPCl8ivbPQsEx96sGBPeCY7AM8PWh7RUcy-TNNJkQ50/edit#heading=h.dwwjnh416i3

Thanks in advance!

bfaught3 commented 5 months ago

Paper Review: Towards a Unified Agent with Foundation Models

Summary Large language models (LLM) and vision language models (VLM) are used in conjunction with reinforcement learning (RL) agents via a framework that uses language as the core reasoning tool. The RL challenges addressed include efficient exploration, reusing experience data, scheduling skills, and learning from observations, all of which would otherwise require separate, vertically designed algorithms. The method is tested in a sparse-reward simulated robotic manipulation environment modeled with the MuJoCo physics simulator, wherein a robot arm interacts with an environment composed of a red, a blue, and a green object in a basket. The problem is formalized as a Markov Decision Process (MDP), where:

The framework is designed so that agents:

Motivations The size and abilities of LLMs and VLMs has resulted in the terms Foundation Models, as they can be used for different downstream applications. This and the observation in these models of common sense reasoning, proposing and sequencing sub-goals, visual understanding, and other properties—which are all characteristic of agents that interact with and learn from environments—motivates the use of these models to bootstrap the process of designing such agents.

Experiments In order to test exploration and curriculum generation through language, the agent was compared against a baseline agent when performing tasks in the virtual environment:

The challenges investigated were:

Limitations The results are intended as a proof-of-concept regarding the applications of LLMs and VLMs towards RL and therefore does not represent an exhaustive survey of LLM/VLM-RL integration. Also, the agent was compared against a baseline agent rather than any kind of SOTA agent that would allow for a comparison of LLM/VLM-based RL agents against SOTA RL techniques. There is also no SOTA metrics to compare against. In particular:

Significance This agent demonstrates applications of LLMs and VLMs towards RL problems. While the results are not compared against SOTA for similar tasks, they are nonetheless striking and demonstrate intuitive advantages that can be provided by a foundation model approach. Sparseness and a lack of intrinsic rewards can be overcome by sub-goal decomposition, and these sub-goals along with a LLM/VLM framework can be used to bootstrap learning for new but similar tasks. The results go against the general trends of RL agents, as generally the number of steps needed to learn a task grows more quickly than the sparseness of the task.

Future Work The researchers are have listed a number of future directions that are particular to their implementation, such as generalizing the state space and action space of the MDP, finetuning CLIP in a more general way, and testing the framework in real-world environments rather than just simulation. For our purposes, we can also investigate comparisons between these approaches and SOTA RL approaches, and their efficacy in our intended environments.

Paper Link: Towards a Unified Agent with Foundation Models