Support for Multimodal Scenarios

Model support must accommodate both inference and fine-tuning, with a higher priority on inference. Even if only inference is available, it should still be added to the TrainableModule.

Text-to-Image Model Stable Diffusion 3 has seen significant improvements in image quality, text rendering, and multi-theme generation. Stable Diffusion 3 Medium
Image-Text Understanding Model InternVL 1.5 supports image description and image Q&A. While it performs slightly worse than Gemini Ultra in TextQA (more focused on image description), it is the best open-source multimodal solution for comprehensive scenarios. InternVL-Chat V1-5 (inference using lmdeploy)
Text-to-Speech Model A new highly praised open-source speech synthesis project trained on 10 million hours of data. ChatTTS The open-source model generates 90-second music clips. The non-open-source Stable Audio 2.0 can produce up to 3-minute-long music, songs, melodies, or vocals. Stable Audio Open 1.0
Speech-to-Text Model OpenAI's Whisper is significantly outperformed by DAMO Academy's FunASR in Chinese. The best results are achieved by combining a recognition model with endpoint detection and punctuation prediction. Three models are needed, listed from top to bottom: speech recognition, endpoint detection, and punctuation prediction. Speech Recognition Endpoint Detection Punctuation Prediction

LazyAGI / LazyLLM

Support for Multimodal Scenarios #33