YunchaoYang / Blogs

blogs and notes, https://yunchaoyang.github.io/blogs/
0 stars 0 forks source link

Stable Diffusion #48

Open YunchaoYang opened 10 months ago

YunchaoYang commented 10 months ago

Introduction

Stable Diffusion AI is a latent diffusion model for generating AI images.

Stable Diffusion architecture

Stable Diffusion is a text-to-image open-source model that you can use to create images of different styles and content simply by providing a text prompt. In the context of text-to-image generation, a diffusion model is a generative model that you can use to generate high-quality images from textual descriptions. Diffusion models are a type of generative model that can capture the complex dependencies between the input and output modalities text and images.

The following diagram shows a high-level architecture of a Stable Diffusion model. image

It consists of the following key elements:

image

What is Diffusion Anyway?

··· Diffusion is the process that takes place inside the pink "image information creator" component. Having the token embeddings that represent the input text, and a random starting image information array (these are also called latents), the process produces an information array that the image decoder uses to paint the final image. ··· image This process happens in a step-by-step fashion. Each step adds more relevant information. To get an intuition of the process, we can inspect the random latents array, and see that it translates to visual noise. Visual inspection in this case is passing it through the image decoder. image

Diffusion happens in multiple steps, each step operates on an input latents array, and produces another latents array that better resembles the input text and all the visual information the model picked up from all images the model was trained on.

image

We can visualize a set of these latents to see what information gets added at each step.

https://github.com/YunchaoYang/Blogs/assets/6526592/662478d2-55e0-4655-8825-609230c67ac1

Something especially fascinating happens between steps 2 and 4 in this case. It's as if the outline emerges from the noise.

How CLIP is trained

CLIP is trained on a dataset of images and their captions. Think of a dataset looking like this, only with 400 million images and their captions: image

CLIP is a combination of an image encoder and a text encoder. Its training process can be simplified to thinking of taking an image and its caption. We encode them both with the image and text encoders respectively.

image

Types of Stable Diffusion models

Model Name | Model Size in GB -- | -- stabilityai/stable-diffusion-2-1-base | 2.5 stabilityai/stable-diffusion-2-depth | 2.7 stabilityai/stable-diffusion-2-inpainting | 2.5 stabilityai/stable-diffusion-x4-upscaler | 7

Application

1. stable diffusion 2

2. stable diffusion webui

running model 1.5

Alternatives of Stable Diffusion

Solution overview

Serve models with an NVIDIA Triton Inference Server Python backend

references

YunchaoYang commented 7 months ago

Tips for Stable diffusion image generation

How to get AI generated photos that never capture the true beauty of your real subject? AI generated images typically has certain imagination, which causes randomness.

Prompt

use ControlNet

Model

  1. Model 0 annotation resolution use openpose get arms

  2. Model 1 First use canny to catch sketch details then adjust weights for multiple models

And use composable lora and script use

Sample method Euler A to sample noise overlay decode steps

  1. how to get face with certain expressions impainting not masked.

references

https://www.youtube.com/watch?v=9v4lBexN_Mg

YunchaoYang commented 7 months ago

A curated list of awesome prompt examples:

https://github.com/Dalabad/stable-diffusion-prompt-templates/tree/main

YunchaoYang commented 7 months ago

AI艺术二维码关键词和参数建议~ 来源:AI 小王子

大模型:Revanimated:

Positive prompt: ((best quality)), ((masterpiece)), (detailed), close-up person, long hair, (fantasy art:1.3), cute cyborg girl, highly detailed face, (render of April:1.1), beautiful artwork illustration, (portrait composition:1.3), (8k resolution:1.2)

Negative prompt: (worst quality:1.2), (low quality:1.2), (lowres:1.1), (monochrome:1.1), (greyscale), multiple views, comic, sketch, (((bad anatomy))), (((deformed))), (((disfigured))), watermark, multiple_views, mutation hands, mutation fingers, extra fingers, missing fingers, watermark

大模型:Realistic vision4.0(以下全是positive prompt)negative prompt同上

A beautiful big castle, sunny day, rocket, plane, galaxy, (cloud:1.4), sky, close up shot, by Hayao Miyazaki, trending on artstation, art, 4k, detailed, colorful, bright, , film grain, studio lighting, HD, highres, salad, dinner best best quality, masterpiece, film grain, hamburger, dinner best best quality, masterpiece, film grain, , film grain, studio lighting A beautiful landscape in sunshine, elf and wizards, by Hayao Miyazaki, trending on artstation, art, 4k, detailed, colorful, bright, , film grain, studio lighting Necronomicon Sketch, page from Necronomicon, crossword puzzle, ((masterpiece),(best quality),(ultra-detailed), trending on artstation, , film grain, studio lighting, HD, highres, A beautiful mountain, trending on artstation, art, 4k, detailed, colorful, bright, , , film grain, studio lighting A beautiful man in fancy costume, trending on artstation, art, 4k, detailed, colorful, bright

大模型:XSarchi Positive prompt: (masterpiece),(high quality), best quality, real,(realistic), super detailed, (full detail),(4k),architecture,Modern style,Blue sky and white clouds build,arbor,landscape,water,officialart,extremelydetailedCGunity8kwallpaper

Negative prompt: (normal quality), (low quality), (worst quality), paintings, sketches,fog,signature,soft, blurry,drawing,sketch, poor quality, uply text,type, word, logo, pixelated, low resolution.,saturated,high contrast, oversharpened,dirt

YunchaoYang commented 7 months ago

ControlNet

ControlNet 主要有 8 个应用模型:OpenPose、Canny、HED、Scribble、Mlsd、Seg、Normal Map、Depth。以下做简要介绍:

OpenPose 姿势识别

通过姿势识别,达到精准控制人体动作。除了生成单人的姿势,它还可以生成多人的姿势,此外还有手部骨骼模型,解决手部绘图不精准问题。以下图为例:左侧为参考图像,经 OpenPose 精准识别后,得出中间的骨骼姿势,再用文生图功能,描述主体内容、场景细节和画风后,就能得到一张同样姿势,但风格完全不同的图。

超详细!AI 绘画神器 Stable Diffusion 基础教程

Canny 边缘检测

Canny 模型可以根据边缘检测,从原始图片中提取线稿,再根据提示词,来生成同样构图的画面,也可以用来给线稿上色。

超详细!AI 绘画神器 Stable Diffusion 基础教程

HED 边缘检测

跟 Canny 类似,但自由发挥程度更高。HED 边界保留了输入图像中的细节,绘制的人物明暗对比明显,轮廓感更强,适合在保持原来构图的基础上对画面风格进行改变时使用。

超详细!AI 绘画神器 Stable Diffusion 基础教程

Scribble 黑白稿提取

涂鸦成图,比 HED 和 Canny 的自由发挥程度更高,也可以用于对手绘线稿进行着色处理。

超详细!AI 绘画神器 Stable Diffusion 基础教程

Mlsd 直线检测

通过分析图片的线条结构和几何形状来构建出建筑外框,适合建筑设计的使用。

超详细!AI 绘画神器 Stable Diffusion 基础教程

Seg 区块标注

通过对原图内容进行语义分割,可以区分画面色块,适用于大场景的画风更改。

超详细!AI 绘画神器 Stable Diffusion 基础教程

Normal Map 法线贴图

适用于三维立体图,通过提取用户输入图片中的 3D 物体的法线向量,以法线为参考绘制出一副新图,此图与原图的光影效果完全相同。

超详细!AI 绘画神器 Stable Diffusion 基础教程

Depth 深度检测

通过提取原始图片中的深度信息,可以生成具有同样深度结构的图。还可以通过 3D 建模软件直接搭建出一个简单的场景,再用 Depth 模型渲染出图。

超详细!AI 绘画神器 Stable Diffusion 基础教程

ControlNet 还有项关键技术是可以开启多个 ControlNet 的组合使用,对图像进行多条件控制。例如:你想对一张图像的背景和人物姿态分别进行控制,那我们可以配置 2 个 ControlNet,第 1 个 ControlNet 使用 Depth 模型对背景进行结构提取并重新风格化,第 2 个 ControlNet 使用 OpenPose 模型对人物进行姿态控制。此外在保持 Seed 种子数相同的情况下,固定出画面结构和风格,然后定义人物不同姿态,渲染后进行多帧图像拼接,就能生成一段动画。

以上通过 ControlNet 的 8 个主要模型,我们解决了图像结构的控制问题。

Reference:

https://www.uisdc.com/stable-diffusion-2

YunchaoYang commented 7 months ago

图像风格控制

Stable Diffusion 实现图像风格化的途径主要有以下几种:Artist 艺术家风格、Checkpoint 预训练大模型、LoRA 微调模型、Textual Inversion 文本反转模型。

YunchaoYang commented 7 months ago

LoRA

In diffusion models, especially those like Stable Diffusion, LoRA stands for Low-Rank Adaptation. It's a popular technique used for fine-tuning these models efficiently, offering several advantages over other methods. Let's delve deeper into it:

What it does:

LoRA allows you to adapt a pre-trained Stable Diffusion model to a specific concept, style, or characterwith significantly smaller file size compared to traditional fine-tuning methods. It achieves this by introducing only a small number of additional parameters (a few hundred megabytes) instead of modifying the entire model (gigabytes).

How it works:

Imagine the main model has its weights (W). LoRA adds a smaller set of weight changes (ΔW). During generation, you can control the influence of these changes using a blending factor (α). Setting α to 0 uses the original model, and setting it to 1 uses the fully fine-tuned version. Values in between blend their effects. Why it works:

The key idea is that specific concepts or styles often manifest as subtle adjustments in the model's internal representations. LoRA focuses on capturing these nuanced changes efficiently. Compared to full fine-tuning, LoRA is: Faster: Training takes fewer resources and computation time. Smaller: Resulting models are much lighter and easier to share. Flexible: You can combine multiple LoRAs for various effects.

While not directly presented in a specific paper, the LoRA technique builds upon several research works:

"Using LoRA for Efficient Stable Diffusion Fine-Tuning" by Hugging Face: This blog post provides a clear explanation and tutorial on using LoRA with Stable Diffusion. "cloneofsimo/lora" on GitHub: This repository delves into the technical details of LoRA implementation and offers tools for working with it. Stable Diffusion paper: Understanding the architecture and cross-attention layers of Stable Diffusion (as mentioned in the Hugging Face blog) helps grasp where LoRA intervenes for fine-tuning.