LazyAGI / LazyLLM

Easiest and laziest way for building multi-agent LLMs applications.
https://docs.lazyllm.ai/
Apache License 2.0
1.03k stars 68 forks source link

Support for Multimodal Scenarios #33

Closed wzh1994 closed 3 months ago

wzh1994 commented 5 months ago

Model support must accommodate both inference and fine-tuning, with a higher priority on inference. Even if only inference is available, it should still be added to the TrainableModule.

  1. Text-to-Image Model Stable Diffusion 3 has seen significant improvements in image quality, text rendering, and multi-theme generation. Stable Diffusion 3 Medium

  2. Image-Text Understanding Model InternVL 1.5 supports image description and image Q&A. While it performs slightly worse than Gemini Ultra in TextQA (more focused on image description), it is the best open-source multimodal solution for comprehensive scenarios. InternVL-Chat V1-5 (inference using lmdeploy)

  3. Text-to-Speech Model A new highly praised open-source speech synthesis project trained on 10 million hours of data. ChatTTS The open-source model generates 90-second music clips. The non-open-source Stable Audio 2.0 can produce up to 3-minute-long music, songs, melodies, or vocals. Stable Audio Open 1.0

  4. Speech-to-Text Model OpenAI's Whisper is significantly outperformed by DAMO Academy's FunASR in Chinese. The best results are achieved by combining a recognition model with endpoint detection and punctuation prediction. Three models are needed, listed from top to bottom: speech recognition, endpoint detection, and punctuation prediction. Speech Recognition Endpoint Detection Punctuation Prediction

wzh1994 commented 3 months ago

https://github.com/LazyAGI/LazyLLM/pull/88 https://github.com/LazyAGI/LazyLLM/pull/100 https://github.com/LazyAGI/LazyLLM/pull/108 https://github.com/LazyAGI/LazyLLM/pull/112