AF Survey - "HuggigngGPT"

(In Progress)

Summary

Authors put forward the idea that LLMs like ChatGPT can be used as controllers for AI systems and use resources like HuggingFace model hub to solve complex tasks. There are four stages in HuggingGPT out of which three use ChatGPT for different purposes.

The first step is task planning. After receiving a user prompt typically with multimedia content, ChatGPT is asked to create an order of tasks based on a predefined list of tasks (like img2text, text2voice) and provide arguments for the inputs to these models. For example, a plan might say that input image <img1.jpg> is to be fed to first model img2text and then the output text <resource-1> is to be fed to the next model, and so on.

The second step is model selection. As the name suggests, they select the exact model on HuggingFace model hub that will be used for the tasks given in the plan from the previous step. They filter the top ones based on their popularity on HuggingFace, retrieve model descriptions and feed them to ChatGPT and ask it to perform model selection.

The third step is task execution where the plan is executed by running models locally or using available endpoints.

Finally, the fourth step is response generation where all the intermediate inputs & outputs (described as resources), and selected model descriptions are fed to ChatGPT to generate a response that explains the output and how the output was generated.

The paper qualitatively discusses some interesting tasks that can be performed using HuggingGPT. They quantitatively analyze the different steps and the output quality. Overall, the idea and implementation seem very neat when it's able to perform the task but the intermediate steps are highly error-prone owing primarily to the inherent difficulty of task planning.

Motivations

Processing multi-modal inputs including visual and speech inputs in addition to text.
Solving complex tasks that involve multiple sub-tasks.
Using expert fine-tuned models for processing and non-fine-tuned LLMs as controller.

Experiments and Results They divide the tasks into 3 types and evaluate the first step i.e. task planning. The three types are based on how the task planning graph looks like: single task, sequential task, graph task. The precision, recall and F-1 values lie between 40-60% with ChatGPT which clearly outperforms Alpaca and Vicuna LLMs that allow prompting with instructions. GPT-4 outperforms GPT-3 on graph tasks.

The test data for single task and sequential task was annotated by GPT-4. For 46 complex tasks, human annotators were invited to provide expert annotations.

Limitations

Planning relies heavily on the quality of planning which is far from being good.
Time cost is high for generating output as many calls are made to ChatGPT API.
Token length limit is also limiting the capabilities of HuggingGPT.
Instability is an issue because LLMs are not perfect at following instructions.
Dependence on possibly unstable model endpoints or loading+running models is also a cause for concern.

Significance This paper showed that LLMs can be used as high-level controllers to solve multi-modal tasks. LLM controller is acting as a common interface between many different models to provide new capabilities. This also allows us to easily accommodate new models and integrate them instantly with the HuggingGPT system.

Future Work Improving the quality of task planning is a major challenge that needs to be solved. Relying solely on LLMs may not be a viable option to develop a production quality system.

Related Work LLMs: GPT-3, GPT-4, PaLM, and LLaMa. Multi-modal LMs: Flamingo, BLIP-2, Kosmos-1 Tool/Model Integration with LLMs: Toolformer, Visual ChatGPT, Visual Programming, ViperGPT

ManifoldRG / Manifold-KB

AF Survey - "HuggigngGPT" #21