[Feat]: Caption/tags enhancement with multimodal LLMs

kabachuha commented 1 month ago

Describe your use-case.

There are multiple simple models used in this repository: Blip, Clip and WD-taggers. However, when it comes to detailed description, they are all dwarfed by modern multimodal LLMs such as LLaVA-like, CogVLM or InternLM-XComposer2 models. The latter has the coolest capabilities as of now, as it allows feeding in up to 4K resolution images, captioning extremely fine-details.

On top of that, unlike existing repo models, these ones can receive text input beside the images, so it is possible to enhance the preexisting captions or tags

As shown by PixArt-series models, especially PixArt-Sigma, well captioned images. However, it applies mainly to LLM-embeddings based models (using T5 or other LLMs, with context > 300) as the models such as CLIP have very limited context length, resolution, embedding layer size and pretrain data to make good captions. (so not much good impact for SD1.5 or SDXL)

What would you like to see as a solution?

Add openai-api/ollama compatible calling mechanism to (batch) captioning section.
Add fully customizable prompt template with ability to insert the pre-available captions or Danbooru tags inside the prompt and where to put the image tokens
Add ability to insert jailbreaks in the start of the answer, such as Sure! Here is the description: " to game aligned models (local, but who got it from datasets such as ShareGPT-4V)
Add parsing of the generated descriptions, maybe with regex
AlignProp RL dataset generation with MLLM's preference out of multiple suggested images

Have you considered alternatives? List them here.

No response

madrooky commented 1 month ago

Most of these features, if not all, you get with TagGui, just in case you or other don't know this tool.

However, one model i would like to see in an expanded list in OT would be moondream2, small sized (4gb) powerful and reliable for natural language caption. More reliable in my experience than 16gb lliama3 competitor or LLaVA, which write way too often plain nonsense and require supervision and careful prompting.

kabachuha commented 1 month ago

@madrooky thanks for your response, looks like a very nice tool ❤️

Nerogar / OneTrainer