adding support for tiva (text-image-video-audio) models-- NExTGPT-7B

First off, I want to give kudos to all that have contributed!! I've been [religiously] using llama.cpp for the past 6 months-ish and it's been amazing seeing the all the growth this project has gone through.

I wanted to open up a discussion on potentially supporting these (TIVA) models--

This repository hosts the code, data and model weight of NExT-GPT, the first end-to-end MM-LLM that perceives input and generates output in arbitrary combinations (any-to-any) of text, image, video, and audio and beyond.

code: https://github.com/NExT-GPT/NExT-GPT

weights: https://huggingface.co/ChocoWu/nextgpt_7b_tiva_v0

NExt-GPT is built on top of existing pre-trained LLM, multimodal encoder and SoTA diffusion models, with sufficient end-to-end instruction tuning. Multimodal Encoding Stage. Leveraging established encoders to encode inputs in various modalities, where these representations are projected into language-like representations comprehensible to the LLM through a projection layer. LLM Understanding and Reasoning Stage. Harnessing an existing open-sourced LLM as the core to process input information for semantic understanding and reasoning. The LLM not only directly generates text tokens but also produces unique “modality signal” tokens that serve as instructions to dictate the decoding layers whether & what modal content to output correspondingly. Multimodal Generation Stage. Receiving the multimodal signals with specific instructions from LLM (if any), the Transformer-based output projection layers map the signal token representations into the ones that are understandable to following multimodal decoders.

I'd recommend going through their paper for specifics.

ggerganov / llama.cpp

adding support for tiva (text-image-video-audio) models-- NExTGPT-7B #4234