EdVince / Stable-Diffusion-NCNN

Stable Diffusion in NCNN with c++, supported txt2img and img2img
BSD 3-Clause "New" or "Revised" License
1k stars 95 forks source link

Android port #3

Closed soham24 closed 1 year ago

EdVince commented 2 years ago

stable diffusion is a big model, it needs 8G ram in x86. it may be a little difficult for mobile device.

soham24 commented 2 years ago

@nihui can I use any of NCNN optimisation to run this on mobile. Time isn’t constraint

hubin858130 commented 1 year ago

@nihui can I use any of NCNN optimisation to run this on mobile. Time isn’t constraint

我发现onnx有一个脚本是将stable dfiffusion的模型量化到int8的,huihui大佬能否提供一个int8的ncnn模型量化脚本,我觉得有希望在mobile上跑一下。

ClashSAN commented 1 year ago

@EdVince some models will run with 0.6gb of ram. this one - int8 quantized at 192x256 for miniSD (https://huggingface.co/lambdalabs/miniSD-diffusers) paired with this script:

from diffusers import StableDiffusionOnnxPipeline
import torch
from diffusers import (
    DDPMScheduler,
    DDIMScheduler,
    PNDMScheduler,
    LMSDiscreteScheduler,
    EulerDiscreteScheduler,
    EulerAncestralDiscreteScheduler,
    DPMSolverMultistepScheduler
)

scheduler = DPMSolverMultistepScheduler.from_pretrained("./model", subfolder="scheduler")

pipe = StableDiffusionOnnxPipeline.from_pretrained(
    './model',
    custom_pipeline="lpw_stable_diffusion_onnx",
    revision="onnx",
    scheduler=scheduler,
    safety_checker=None,
    provider="CPUExecutionProvider"
)

prompt = "test prompt"
neg_prompt = ""

generator = torch.Generator(device="cpu").manual_seed(1)

image = pipe.text2img(prompt,negative_prompt=neg_prompt, num_inference_steps=8, width=192, height=256, guidance_scale=10, generator=generator, max_embeddings_multiples=3).images[0]
image.save('./test.png')

If you can also allow for smaller sizes, that would be great, as it will definitely run them on a lower ram SoC. I can run quantized onnx model on 3gb ram SoC

See if you can run a mixture of uint8 unet and fp16 text encoder, I've found that does improve picture quality (if you have good prompt :0 )

additional info - https://github.com/AUTOMATIC1111/stable-diffusion-webui/discussions/6585

EdVince commented 1 year ago

@EdVince some models will run with 0.6gb of ram. this one - int8 quantized at 192x256 for miniSD (https://huggingface.co/lambdalabs/miniSD-diffusers) paired with this script:

from diffusers import StableDiffusionOnnxPipeline
import torch
from diffusers import (
    DDPMScheduler,
    DDIMScheduler,
    PNDMScheduler,
    LMSDiscreteScheduler,
    EulerDiscreteScheduler,
    EulerAncestralDiscreteScheduler,
    DPMSolverMultistepScheduler
)

scheduler = DPMSolverMultistepScheduler.from_pretrained("./model", subfolder="scheduler")

pipe = StableDiffusionOnnxPipeline.from_pretrained(
    './model',
    custom_pipeline="lpw_stable_diffusion_onnx",
    revision="onnx",
    scheduler=scheduler,
    safety_checker=None,
    provider="CPUExecutionProvider"
)

prompt = "test prompt"
neg_prompt = ""

generator = torch.Generator(device="cpu").manual_seed(1)

image = pipe.text2img(prompt,negative_prompt=neg_prompt, num_inference_steps=8, width=192, height=256, guidance_scale=10, generator=generator, max_embeddings_multiples=3).images[0]
image.save('./test.png')

If you can also allow for smaller sizes, that would be great, as it will definitely run them on a lower ram SoC. I can run quantized onnx model on 3gb ram SoC

See if you can run a mixture of uint8 unet and fp16 text encoder, I've found that does improve picture quality (if you have good prompt :0 )

additional info - AUTOMATIC1111/stable-diffusion-webui#6585

As the latest update, now in android, we only need totally 2.6G for 256*256pix, this is achieved by the float16 mode. Currently I focus on the cpu optimization, may be after some days, I will try to optimize the gpu mode which will required less ram. But ignoring the future optimization, if you turn on the gpu now, it will lower some ram in x86, and may be in android it will also lower the ram. And for the int8, I will try it in later, but it may be a little hard.

for the last point, now the code is coupling on the naifu model, I know it is a little bit out of the date. May be we should switch to the diffusers. But I have no plans to make this at the moment, because I'm committed to ncnn and I don't exactly work with diffusion models. And I don't think the diffusion model is currently practical on mobile devices.

ClashSAN commented 1 year ago

@EdVince I've tried the apk, it runs on my 6gb 865+, well done! If you are still working on improving this, what are you going for next? Maybe half the step needed if you switch from Euler A to the sampler (known in webui) as DPM++ 2M / Karras? Many thanks for your efforts.

EdVince commented 1 year ago

@EdVince I've tried the apk, it runs on my 6gb 865+, well done! If you are still working on improving this, what are you going for next? Maybe half the step needed if you switch from Euler A to the sampler (known in webui) as DPM++ 2M / Karras? Many thanks for your efforts.

I spend some time for the new sampler " DPM++ 2M Karras", I hava add the code in the x86 project, you can find that from the latest commit. But there still are some problems, in less setp, the DPM++ 2M Karras will give bad result, so I annotate that code, you can try it and try to fix it too.