ggerganov / llama.cpp

LLM inference in C/C++
MIT License
66.06k stars 9.49k forks source link

adding support for tiva (text-image-video-audio) models-- NExTGPT-7B #4234

Closed itsPreto closed 6 months ago

itsPreto commented 10 months ago

First off, I want to give kudos to all that have contributed!! I've been [religiously] using llama.cpp for the past 6 months-ish and it's been amazing seeing the all the growth this project has gone through.

I wanted to open up a discussion on potentially supporting these (TIVA) models--

This repository hosts the code, data and model weight of NExT-GPT, the first end-to-end MM-LLM that perceives input and generates output in arbitrary combinations (any-to-any) of text, image, video, and audio and beyond.

code: https://github.com/NExT-GPT/NExT-GPT

weights: https://huggingface.co/ChocoWu/nextgpt_7b_tiva_v0

NExt-GPT is built on top of existing pre-trained LLM, multimodal encoder and SoTA diffusion models, with sufficient end-to-end instruction tuning. image Multimodal Encoding Stage. Leveraging established encoders to encode inputs in various modalities, where these representations are projected into language-like representations comprehensible to the LLM through a projection layer. LLM Understanding and Reasoning Stage. Harnessing an existing open-sourced LLM as the core to process input information for semantic understanding and reasoning. The LLM not only directly generates text tokens but also produces unique “modality signal” tokens that serve as instructions to dictate the decoding layers whether & what modal content to output correspondingly. Multimodal Generation Stage. Receiving the multimodal signals with specific instructions from LLM (if any), the Transformer-based output projection layers map the signal token representations into the ones that are understandable to following multimodal decoders.

I'd recommend going through their paper for specifics.

github-actions[bot] commented 6 months ago

This issue was closed because it has been inactive for 14 days since being marked as stale.