[Feature Request]: can you merge oneflow frameswork

wjizhong commented 1 year ago

Is there an existing issue for this?

[X] I have searched the existing issues and checked the recent builds/commits

What would your feature do ?

can you merge oneflow frameswork

Proposed workflow

Go to ....
Press ....
...

Additional information

No response

ClashSAN commented 1 year ago

maybe provide further details for us.

happyme531 commented 1 year ago

maybe provide further details for us.

Oneflow diffusers library: https://github.com/Oneflow-Inc/diffusers

From my tests(using the Oneflow/Huggingface diffusers library), using the oneflow framework gives an amazing 250% speed up comparing with the original(current) implementation using pytorch. That means an RTX3060 running Oneflow == an RTX3090 running Pytorch, absolutely free GPU upgrade!

ClashSAN commented 1 year ago

If we are comparing these, why is it fast? Are we comparing with --xformers optimization enabled? Does the oneflow framework use such optimizations like flash attention which is found in xformers? If we find something easy to port to this compvis architecture, perhaps someone can PR.

happyme531 commented 1 year ago

If we are comparing these, why is it fast? Are we comparing with --xformers optimization enabled? Does the oneflow framework use such optimizations like flash attention which is found in xformers? If we find something easy to port to this compvis architecture, perhaps someone can PR.

Testing under manjaro linux, resolution: 512x512, datatype: fp16 , sampler: ddim, steps: 50, model: stable diffusion v1.4, guidance_scale: 7, cpu: R9 3900x, gpu: RTX3090 24GiB, prompt: a woman riding a dragon.

Huggingface diffusers library: Avg speed 12.1 it/s, Peak RAM 4.82GiB, Peak VRAM 5.30GiB Stable diffusion webui no xformers: Avg speed 14.4 it/s, Peak RAM 5.11GiB, Peak VRAM 4.69GiB Stable diffusion webui w/ xformers: Avg speed 17.2 it/s, Peak RAM 5.56GiB, Peak VRAM 4.14GiB Oneflow diffusers library: Avg speed 25.7 it/s Peak RAM 5.66GiB, Peak VRAM 11.92GiB(!)

Well, trading space for time......

Blues-star commented 1 year ago

Looks good

yuanms2 commented 1 year ago

If we are comparing these, why is it fast? Are we comparing with --xformers optimization enabled? Does the oneflow framework use such optimizations like flash attention which is found in xformers? If we find something easy to port to this compvis architecture, perhaps someone can PR.

Testing under manjaro linux, resolution: 512x512, datatype: fp16 , sampler: ddim, steps: 50, model: stable diffusion v1.4, guidance_scale: 7, cpu: R9 3900x, gpu: RTX3090 24GiB, prompt: a woman riding a dragon.

Huggingface diffusers library: Avg speed 12.1 it/s, Peak RAM 4.82GiB, Peak VRAM 5.30GiB Stable diffusion webui no xformers: Avg speed 14.4 it/s, Peak RAM 5.11GiB, Peak VRAM 4.69GiB Stable diffusion webui w/ xformers: Avg speed 17.2 it/s, Peak RAM 5.56GiB, Peak VRAM 4.14GiB Oneflow diffusers library: Avg speed 25.7 it/s Peak RAM 5.66GiB, Peak VRAM 11.92GiB(!)

Well, trading space for time......

@happyme531 thanks, would you like to have a try on the new version? The memory footprint of oneflow has been greatly reduced.

If yes, please follow the steps here: https://github.com/Oneflow-Inc/diffusion-benchmark/blob/main/Dockerfile

MirrorCY commented 1 year ago

If we are comparing these, why is it fast? Are we comparing with --xformers optimization enabled? Does the oneflow framework use such optimizations like flash attention which is found in xformers? If we find something easy to port to this compvis architecture, perhaps someone can PR.

They have used xformers to optimize. https://github.com/Oneflow-Inc/diffusers/issues/5#issuecomment-1336953449

MirrorCY commented 1 year ago

I am testing the text2img drawing using OneFlowStableDiffusionPipeline, and I got an average speed of 27it/s on a 3090 GPU. The scheduler I used is DPMSolverMultistepScheduler (which should be similar to DPM++), with a resolution of 512*512, batch size 1 and steps 15.

MirrorCY commented 1 year ago

With the same parameters, the speed obtained using sd-webui is approximately 7it/s. Xformers optimization is enabled and the sampler is DPM++ 2S a Karras.

happyme531 commented 1 year ago

If we are comparing these, why is it fast? Are we comparing with --xformers optimization enabled? Does the oneflow framework use such optimizations like flash attention which is found in xformers? If we find something easy to port to this compvis architecture, perhaps someone can PR.

Testing under manjaro linux, resolution: 512x512, datatype: fp16 , sampler: ddim, steps: 50, model: stable diffusion v1.4, guidance_scale: 7, cpu: R9 3900x, gpu: RTX3090 24GiB, prompt: a woman riding a dragon.

Huggingface diffusers library: Avg speed 12.1 it/s, Peak RAM 4.82GiB, Peak VRAM 5.30GiB Stable diffusion webui no xformers: Avg speed 14.4 it/s, Peak RAM 5.11GiB, Peak VRAM 4.69GiB Stable diffusion webui w/ xformers: Avg speed 17.2 it/s, Peak RAM 5.56GiB, Peak VRAM 4.14GiB Oneflow diffusers library: Avg speed 25.7 it/s Peak RAM 5.66GiB, Peak VRAM 11.92GiB(!)

Well, trading space for time......

Tested again with the

oneflow (0.9.1.dev20230215+cu117),
xformers (0.0.16rc425),
torch (1.13.1)
TensorRT
Triton
safetensors and accelerate REMOVED (not working with oneflow)

resolution: 512x512, datatype: fp16 , sampler: PNDM/DDIM (ddim not working properly for oneflow(Low image quality with same generation time). Other samplers seems available too, will try again later), steps: 50, model: stable diffusion v1.4, guidance_scale: 7, cpu: R9 3900x, gpu: RTX3090 24GiB, prompt: a woman riding a dragon.

Stable diffusion webui no xformers: Avg speed 15.5 it/s, Peak RAM 6.6GiB, Peak VRAM 4.8GiB
Stable diffusion webui w/ xformers: Avg speed 18.4 it/s, Peak RAM 7.1GiB, Peak VRAM 3.7GiB
Oneflow diffusers library: Avg speed 38.7 it/s Peak RAM 8.4GiB, Peak VRAM 5.4GiB

This is mind-blowing, Just unbelieveably fast, Still 2x faster even comparing with the webui after applying xformers optimization.

PLZ have a look at it. @AUTOMATIC1111

happyme531 commented 1 year ago

Tested again with Euler A sampler and 32 steps for both applications. The result for performance and vram usage is roughly the same as the last test. Just to show some images.

Prompt: "A digital illustration of a woman riding a dragon, she has red hair and a leather armor, the dragon is black and has red eyes, 8K masterpiece" (the prompt is not fine-tuned at all 🤔 )

Oneflow Diffusers: output4 output3 output2

Stable diffusion webui+xformers:

Test script for Oneflow Diffusers:

import oneflow as torch
import sys
from diffusers import OneFlowStableDiffusionPipeline
from diffusers.schedulers import OneFlowEulerAncestralDiscreteScheduler
from diffusers.schedulers import OneFlowPNDMScheduler

import time
import os

model_id = "CompVis/stable-diffusion-v1-4"

device = "cuda"

pipe = OneFlowStableDiffusionPipeline.from_pretrained(model_id, use_auth_token=True, revision="fp16",torch_dtype=torch.float16)
pipe.scheduler = OneFlowEulerAncestralDiscreteScheduler.from_config(pipe.scheduler.config)
print("pipe: ", pipe)
pipe = pipe.to(device)

width=512
height=512
iterations = 32

prompt = ""
count = 0
targetcount = 0
while True:
  try:
    if targetcount == 0:
      inputStr = input("Prompt: ")
      if inputStr == "":
          pass
      elif inputStr.isdigit():
          targetcount = int(inputStr)
          print("targetcount: ", targetcount)
          continue
      else:
          prompt = inputStr
    targetcount = targetcount - 1
    if targetcount < 0:
        targetcount = 0

    t = time.time()
    with torch.autocast("cuda"):
     data = pipe(prompt, guidance_scale=7, width=width,
                  height=height, num_inference_steps=iterations)
     print("data: ", data)
     image = data["images"][0]
    t2 = time.time()
    print("Time taken: {:.2f}s, it/s: {:.2f}".format(t2-t, iterations/(t2-t)))
    count += 1
    image.save(f"output{count}.png")
    os.system(f"fish -c \" open output{count}.png \"")
  except KeyboardInterrupt:
    print("\nExiting")
    print("Last prompt: " + prompt)
    break

Starlento commented 1 year ago

One thing just as a note. OneFlow does not support Windows currently. And for WSL2, it is not completely supported. https://github.com/Oneflow-Inc/oneflow/issues/9398

liuguicen commented 1 year ago

Very supportive!

AoeSyL commented 1 year ago

非常支持！

sf467 commented 7 months ago

looks good,

AUTOMATIC1111 / stable-diffusion-webui