[feature] Ability to use AnyGPT for speech/text/image/music multimodality

kabachuha commented 1 month ago

AnyGPT is quite a promising project released 2 months before GPT4o.

It is a versatile multimodal LLaMA-based model, which is able not only to take images as an input, but also non-transcribed speech (for example, for cloning), music. And the output is also speech, images and music in the tokens form, what are fed into (inplicitly-represented, e.g. UnCLIP instead of prompts for StableDiffusion) specialized models to generate the outputs.

anygpy demo

I think such concept can improve the o-like experience, although it may require to adjust the encoder/decoder backends to make the generation faster.

See the project page https://junzhan2000.github.io/AnyGPT.github.io/

https://github.com/OpenMOSS/AnyGPT

P.S. I think it would be a much better addition, than just giving it vision via the legacy llava

https://github.com/dnhkng/GlaDOS/blob/9d6bc9ac9e2cfc5d67edbd4d4fef620769984ff5/README.md?plain=1#L14

dnhkng commented 1 month ago

The model is really cool!

I'm not sure how those modalities would be useful though. For example, it could generate music, or just as easily, have function calling and pick a track on Spotify. Same for the image generation.

kabachuha commented 1 month ago

As I get, it's still quite a general purpose model and function-calling should work with it as well (maybe some tuning, if they overwrote normal instructions too much)

The best thing is that the model acquires better semantic understanding of things (how words, sounds, music and images connect with each other), so it may be worth exploring it's "creative soul" :)

Decoders like StableDiffusion can work very fast now on proper GPUs, if you use things like LCMs and tricks allowing for few/one step diffusion

dnhkng / GlaDOS

[feature] Ability to use AnyGPT for speech/text/image/music multimodality #47