ManifoldRG / Manifold-KB

This repository serves as a knowledge base with key insights, details from other research and implementations to serve as references and one place to document various possible paths to achieve something.
GNU General Public License v3.0
4 stars 0 forks source link

AF Survey - "HuggigngGPT" #21

Open KrnTneja opened 10 months ago

KrnTneja commented 10 months ago

HuggingGPT: Paper, Code Estimated time: 09/03/2023

Slides (Karan)

Review material tracking sheet Community guidelines doc

KrnTneja commented 10 months ago

(In Progress)

Summary

Authors put forward the idea that LLMs like ChatGPT can be used as controllers for AI systems and use resources like HuggingFace model hub to solve complex tasks. There are four stages in HuggingGPT out of which three use ChatGPT for different purposes.

The first step is task planning. After receiving a user prompt typically with multimedia content, ChatGPT is asked to create an order of tasks based on a predefined list of tasks (like img2text, text2voice) and provide arguments for the inputs to these models. For example, a plan might say that input image <img1.jpg> is to be fed to first model img2text and then the output text <resource-1> is to be fed to the next model, and so on.

The second step is model selection. As the name suggests, they select the exact model on HuggingFace model hub that will be used for the tasks given in the plan from the previous step. They filter the top ones based on their popularity on HuggingFace, retrieve model descriptions and feed them to ChatGPT and ask it to perform model selection.

The third step is task execution where the plan is executed by running models locally or using available endpoints.

Finally, the fourth step is response generation where all the intermediate inputs & outputs (described as resources), and selected model descriptions are fed to ChatGPT to generate a response that explains the output and how the output was generated.

The paper qualitatively discusses some interesting tasks that can be performed using HuggingGPT. They quantitatively analyze the different steps and the output quality. Overall, the idea and implementation seem very neat when it's able to perform the task but the intermediate steps are highly error-prone owing primarily to the inherent difficulty of task planning.

image

Motivations

Experiments and Results They divide the tasks into 3 types and evaluate the first step i.e. task planning. The three types are based on how the task planning graph looks like: single task, sequential task, graph task. The precision, recall and F-1 values lie between 40-60% with ChatGPT which clearly outperforms Alpaca and Vicuna LLMs that allow prompting with instructions. GPT-4 outperforms GPT-3 on graph tasks.

The test data for single task and sequential task was annotated by GPT-4. For 46 complex tasks, human annotators were invited to provide expert annotations.

Limitations

Significance This paper showed that LLMs can be used as high-level controllers to solve multi-modal tasks. LLM controller is acting as a common interface between many different models to provide new capabilities. This also allows us to easily accommodate new models and integrate them instantly with the HuggingGPT system.

Future Work Improving the quality of task planning is a major challenge that needs to be solved. Relying solely on LLMs may not be a viable option to develop a production quality system.

Related Work LLMs: GPT-3, GPT-4, PaLM, and LLaMa. Multi-modal LMs: Flamingo, BLIP-2, Kosmos-1 Tool/Model Integration with LLMs: Toolformer, Visual ChatGPT, Visual Programming, ViperGPT