Nerogar / OneTrainer

OneTrainer is a one-stop solution for all your stable diffusion training needs.
GNU Affero General Public License v3.0
1.33k stars 110 forks source link

[Feat]: Caption/tags enhancement with multimodal LLMs #313

Open kabachuha opened 1 month ago

kabachuha commented 1 month ago

Describe your use-case.

There are multiple simple models used in this repository: Blip, Clip and WD-taggers. However, when it comes to detailed description, they are all dwarfed by modern multimodal LLMs such as LLaVA-like, CogVLM or InternLM-XComposer2 models. The latter has the coolest capabilities as of now, as it allows feeding in up to 4K resolution images, captioning extremely fine-details.

On top of that, unlike existing repo models, these ones can receive text input beside the images, so it is possible to enhance the preexisting captions or tags

As shown by PixArt-series models, especially PixArt-Sigma, well captioned images. However, it applies mainly to LLM-embeddings based models (using T5 or other LLMs, with context > 300) as the models such as CLIP have very limited context length, resolution, embedding layer size and pretrain data to make good captions. (so not much good impact for SD1.5 or SDXL)

What would you like to see as a solution?

Have you considered alternatives? List them here.

No response

madrooky commented 1 month ago

Most of these features, if not all, you get with TagGui, just in case you or other don't know this tool.

However, one model i would like to see in an expanded list in OT would be moondream2, small sized (4gb) powerful and reliable for natural language caption. More reliable in my experience than 16gb lliama3 competitor or LLaVA, which write way too often plain nonsense and require supervision and careful prompting.

kabachuha commented 1 month ago

@madrooky thanks for your response, looks like a very nice tool ❤️