Open wjizhong opened 1 year ago
maybe provide further details for us.
maybe provide further details for us.
Oneflow diffusers library: https://github.com/Oneflow-Inc/diffusers
From my tests(using the Oneflow/Huggingface diffusers library), using the oneflow framework gives an amazing 250% speed up comparing with the original(current) implementation using pytorch. That means an RTX3060 running Oneflow == an RTX3090 running Pytorch, absolutely free GPU upgrade!
If we are comparing these, why is it fast?
Are we comparing with --xformers optimization enabled?
Does the oneflow framework use such optimizations like flash attention
which is found in xformers? If we find something easy to port to this compvis architecture, perhaps someone can PR.
If we are comparing these, why is it fast? Are we comparing with --xformers optimization enabled? Does the oneflow framework use such optimizations like
flash attention
which is found in xformers? If we find something easy to port to this compvis architecture, perhaps someone can PR.
Testing under manjaro linux, resolution: 512x512, datatype: fp16 , sampler: ddim, steps: 50, model: stable diffusion v1.4, guidance_scale: 7, cpu: R9 3900x, gpu: RTX3090 24GiB, prompt: a woman riding a dragon.
Huggingface diffusers library: Avg speed 12.1 it/s, Peak RAM 4.82GiB, Peak VRAM 5.30GiB Stable diffusion webui no xformers: Avg speed 14.4 it/s, Peak RAM 5.11GiB, Peak VRAM 4.69GiB Stable diffusion webui w/ xformers: Avg speed 17.2 it/s, Peak RAM 5.56GiB, Peak VRAM 4.14GiB Oneflow diffusers library: Avg speed 25.7 it/s Peak RAM 5.66GiB, Peak VRAM 11.92GiB(!)
Well, trading space for time......
Looks good
If we are comparing these, why is it fast? Are we comparing with --xformers optimization enabled? Does the oneflow framework use such optimizations like
flash attention
which is found in xformers? If we find something easy to port to this compvis architecture, perhaps someone can PR.Testing under manjaro linux, resolution: 512x512, datatype: fp16 , sampler: ddim, steps: 50, model: stable diffusion v1.4, guidance_scale: 7, cpu: R9 3900x, gpu: RTX3090 24GiB, prompt: a woman riding a dragon.
Huggingface diffusers library: Avg speed 12.1 it/s, Peak RAM 4.82GiB, Peak VRAM 5.30GiB Stable diffusion webui no xformers: Avg speed 14.4 it/s, Peak RAM 5.11GiB, Peak VRAM 4.69GiB Stable diffusion webui w/ xformers: Avg speed 17.2 it/s, Peak RAM 5.56GiB, Peak VRAM 4.14GiB Oneflow diffusers library: Avg speed 25.7 it/s Peak RAM 5.66GiB, Peak VRAM 11.92GiB(!)
Well, trading space for time......
@happyme531 thanks, would you like to have a try on the new version? The memory footprint of oneflow has been greatly reduced.
If yes, please follow the steps here: https://github.com/Oneflow-Inc/diffusion-benchmark/blob/main/Dockerfile
If we are comparing these, why is it fast? Are we comparing with --xformers optimization enabled? Does the oneflow framework use such optimizations like
flash attention
which is found in xformers? If we find something easy to port to this compvis architecture, perhaps someone can PR.
They have used xformers to optimize. https://github.com/Oneflow-Inc/diffusers/issues/5#issuecomment-1336953449
I am testing the text2img drawing using OneFlowStableDiffusionPipeline, and I got an average speed of 27it/s on a 3090 GPU. The scheduler I used is DPMSolverMultistepScheduler (which should be similar to DPM++), with a resolution of 512*512, batch size 1 and steps 15.
With the same parameters, the speed obtained using sd-webui is approximately 7it/s. Xformers optimization is enabled and the sampler is DPM++ 2S a Karras.
If we are comparing these, why is it fast? Are we comparing with --xformers optimization enabled? Does the oneflow framework use such optimizations like
flash attention
which is found in xformers? If we find something easy to port to this compvis architecture, perhaps someone can PR.Testing under manjaro linux, resolution: 512x512, datatype: fp16 , sampler: ddim, steps: 50, model: stable diffusion v1.4, guidance_scale: 7, cpu: R9 3900x, gpu: RTX3090 24GiB, prompt: a woman riding a dragon.
Huggingface diffusers library: Avg speed 12.1 it/s, Peak RAM 4.82GiB, Peak VRAM 5.30GiB Stable diffusion webui no xformers: Avg speed 14.4 it/s, Peak RAM 5.11GiB, Peak VRAM 4.69GiB Stable diffusion webui w/ xformers: Avg speed 17.2 it/s, Peak RAM 5.56GiB, Peak VRAM 4.14GiB Oneflow diffusers library: Avg speed 25.7 it/s Peak RAM 5.66GiB, Peak VRAM 11.92GiB(!)
Well, trading space for time......
Tested again with the
resolution: 512x512, datatype: fp16 , sampler: PNDM/DDIM (ddim not working properly for oneflow(Low image quality with same generation time). Other samplers seems available too, will try again later), steps: 50, model: stable diffusion v1.4, guidance_scale: 7, cpu: R9 3900x, gpu: RTX3090 24GiB, prompt: a woman riding a dragon.
Stable diffusion webui no xformers: Avg speed 15.5 it/s, Peak RAM 6.6GiB, Peak VRAM 4.8GiB
Stable diffusion webui w/ xformers: Avg speed 18.4 it/s, Peak RAM 7.1GiB, Peak VRAM 3.7GiB
Oneflow diffusers library: Avg speed 38.7 it/s Peak RAM 8.4GiB, Peak VRAM 5.4GiB
This is mind-blowing, Just unbelieveably fast, Still 2x faster even comparing with the webui after applying xformers optimization.
PLZ have a look at it. @AUTOMATIC1111
Tested again with Euler A sampler and 32 steps for both applications. The result for performance and vram usage is roughly the same as the last test. Just to show some images.
Prompt: "A digital illustration of a woman riding a dragon, she has red hair and a leather armor, the dragon is black and has red eyes, 8K masterpiece" (the prompt is not fine-tuned at all 🤔 )
Oneflow Diffusers:
Stable diffusion webui+xformers:
Test script for Oneflow Diffusers:
import oneflow as torch
import sys
from diffusers import OneFlowStableDiffusionPipeline
from diffusers.schedulers import OneFlowEulerAncestralDiscreteScheduler
from diffusers.schedulers import OneFlowPNDMScheduler
import time
import os
model_id = "CompVis/stable-diffusion-v1-4"
device = "cuda"
pipe = OneFlowStableDiffusionPipeline.from_pretrained(model_id, use_auth_token=True, revision="fp16",torch_dtype=torch.float16)
pipe.scheduler = OneFlowEulerAncestralDiscreteScheduler.from_config(pipe.scheduler.config)
print("pipe: ", pipe)
pipe = pipe.to(device)
width=512
height=512
iterations = 32
prompt = ""
count = 0
targetcount = 0
while True:
try:
if targetcount == 0:
inputStr = input("Prompt: ")
if inputStr == "":
pass
elif inputStr.isdigit():
targetcount = int(inputStr)
print("targetcount: ", targetcount)
continue
else:
prompt = inputStr
targetcount = targetcount - 1
if targetcount < 0:
targetcount = 0
t = time.time()
with torch.autocast("cuda"):
data = pipe(prompt, guidance_scale=7, width=width,
height=height, num_inference_steps=iterations)
print("data: ", data)
image = data["images"][0]
t2 = time.time()
print("Time taken: {:.2f}s, it/s: {:.2f}".format(t2-t, iterations/(t2-t)))
count += 1
image.save(f"output{count}.png")
os.system(f"fish -c \" open output{count}.png \"")
except KeyboardInterrupt:
print("\nExiting")
print("Last prompt: " + prompt)
break
One thing just as a note. OneFlow does not support Windows currently. And for WSL2, it is not completely supported. https://github.com/Oneflow-Inc/oneflow/issues/9398
Very supportive!
非常支持!
looks good,
Is there an existing issue for this?
What would your feature do ?
can you merge oneflow frameswork
Proposed workflow
Additional information
No response