AUTOMATIC1111 / stable-diffusion-webui

Stable Diffusion web UI
GNU Affero General Public License v3.0
140.62k stars 26.61k forks source link

[Feature Request]: Support for Apple's Core ML Stable Diffusion #5309

Open jkcarney opened 1 year ago

jkcarney commented 1 year ago

Is there an existing issue for this?

What would your feature do ?

https://github.com/apple/ml-stable-diffusion

Apple very recently added support to convert Stable Diffusion models to the CoreML format to allow for faster generation time.

  1. It would be nice to support this conversion pipeline within the web UI, perhaps as an option in an extras tab or checkpoint merger (its not really a merge per say, but it could apply?)
  2. Allow the webUI to run the coreML models instead of the regular SD pytorch models.

Proposed workflow

  1. Go to the extras tab or checkpoint merger tab
  2. Select a script or similar to convert a .ckpt file in your models directory to the coreML format
  3. Allow use of the coreML model within the webUI for Apple users.

Additional information

No response

NightMachinery commented 1 year ago

https://github.com/apple/ml-stable-diffusion/issues/9

ioma8 commented 1 year ago

This please!

autumnmotor commented 1 year ago

https://github.com/apple/ml-stable-diffusion

I've tried running and inspecting the sample from the repository above (still investigating), It looks like the coreml format does not reduce image generation time. Rather than coreml, the pytorch implementation is slightly faster (at least on my MacStudio, M1Ultra, 48GPU). It's good to want Automatic1111 to support coreml format, but before that, each person who has M1Mac should do some benchmarking and carefully consider whether it's really an urgent matter to request.

Translated from Japanese to English by Google.

ioma8 commented 1 year ago

Yes I have also finally tried it on Mac M1 and it is indeed slower than current implementations.

sascha1337 commented 1 year ago

u tried the python or swift one ?

NightMachinery commented 1 year ago

Yes I have also finally tried it on Mac M1 and it is indeed slower than current implementations.

Please publish all the relevant details, e.g., macOS version (latest 13.1 beta is needed), which Mac, which compute units, Swift or Python, and whether you have included the model loading time.

Their own benchmarks say that an M2 generates an image in 23 seconds, which is certainly much faster than PyTorch. I myself don't have macOS 13.1 and Xcode installed to test.

ronytomen commented 1 year ago

I can test on Macbook Air M2 with 24GB of RAM. But a little guidance on how, no crazy detail needed, would be nice.

autumnmotor commented 1 year ago

MacStudio(M1ultra, 128GB RAM, 48cores GPU), macOS Ventura(13.1beta), xcode14.1 imagesize:512x512 , steps:20(21 in coreml_python), model:CompVis/stable-diffusion-v1-4 (not include model load time)

WebUI:2.61it/s (use MPS) coreml(Python):2.64it/s (use cpu,gpu,ane) coreml(Swift):2.23it/s (use cpu,gpu,ane)

There doesn't seem to be a dramatic difference in speed.

Automatic1111-SD-WebUI(sampling method:Euler a) use MPS

Total progress: 100%|███████████████████████████| 20/20 [00:07<00:00, 2.61it/s]

use cpuonly(--use-cpu all)

Total progress: 100%|███████████████████████████| 20/20 [01:10<00:00, 3.51s/it]

=0.285it/s

coreml(Python, Scheduler:default, maybe DDIM) python -m python_coreml_stable_diffusion.pipeline --prompt "a photo of an astronaut riding a horse on mars" -i ./sdmodel -o ./out --compute-unit ALL --seed 93 --num-inference-steps 20

100%|███████████████████████████████████████████| 21/21 [00:07<00:00, 2.64it/s]

python -m python_coreml_stable_diffusion.pipeline --prompt "a photo of an astronaut riding a horse on mars" -i ./sdmodel -o ./out --compute-unit CPU_AND_GPU --seed 93 --num-inference-steps 20

100%|███████████████████████████████████████████| 21/21 [00:12<00:00, 1.73it/s]

python -m python_coreml_stable_diffusion.pipeline --prompt "a photo of an astronaut riding a horse on mars" -i ./sdmodel -o ./out --compute-unit CPU_AND_NE --seed 93 --num-inference-steps 20

100%|███████████████████████████████████████████| 21/21 [00:17<00:00, 1.22it/s]

coreml(Swift, Scheduler or Samplemethod: Unknown because I don't know much about swift) swift run StableDiffusionSample "a photo of an astronaut riding a horse on mars" --resource-path ./sdmodel/Resources/ --seed 93 --output-path ./out --step-count 20 --compute-units all

Step 20 of 20 [mean: 2.23, median: 2.50, last 2.46] step/sec

swift run StableDiffusionSample "a photo of an astronaut riding a horse on mars" --resource-path ./sdmodel/Resources/ --seed 93 --output-path ./out --step-count 20 --compute-units cpuAndNeuralEngine

Step 20 of 20 [mean: 1.16, median: 1.17, last 1.16] step/sec

swift run StableDiffusionSample "a photo of an astronaut riding a horse on mars" --resource-path ./sdmodel/Resources/ --seed 93 --output-path ./out --step-count 20 --compute-units cpuAndGPU

Step 20 of 20 [mean: 1.86, median: 2.96, last 2.95] step/sec

swift run StableDiffusionSample "a photo of an astronaut riding a horse on mars" --resource-path ./sdmodel/Resources/ --seed 93 --output-path ./out --step-count 20 --compute-units cpuOnly

Step 20 of 20 [mean: 0.12, median: 0.12, last 0.12] step/sec

NightMachinery commented 1 year ago

@autumnmotor Aren't your results much better than Torch MPS? 1.22it/s vs 2.61it/s.

use MPS

Total progress: 100%|███████████████████████████| 20/20 [00:07<00:00, 2.61it/s]
python -m python_coreml_stable_diffusion.pipeline --prompt "a photo of an astronaut riding a horse on mars" -i ./sdmodel -o ./out --compute-unit CPU_AND_NE --seed 93 --num-inference-steps 20

100%|███████████████████████████████████████████| 21/21 [00:17<00:00, 1.22it/s]
sascha1337 commented 1 year ago

INFO:python_coreml_stable_diffusion.coreml_model:Loading a CoreML model through coremltools triggers compilation every time. The Swift package we provide uses precompiled Core ML models (.mlmodelc) to avoid compile-on-load.

... and rumors tell us, we should use beta4 ventura 13.1

what Xact MacOS build u use, sir ?

autumnmotor commented 1 year ago

@NightMachinery Hmmm... I think that the higher the unit of "it/s" (iterate per second), the better the performance... Also check the total processing time on the left.(not include model load) webui(use mps):7sec coreml_python:17sec

@sascha1337

... and rumors tell us, we should use beta4 ventura 13.1

my macOS is 13.1 Beta(22C5033e). it's mean beta "1" Thanks for the very useful information. Luckily I'm in the Apple Developer Program, so I'll try it later.

autumnmotor commented 1 year ago

macOS13.1beta1 -> beta4

WebUI(use MPS):2.61it/s -> 2.66it/s coreml(Python,use cpu,gpu,ane):2.64it/s -> 2.65it/s coreml(Swift,use cpu,gpu,ane):2.23it/s -> 2.23it/s

I still need to look into it more carefully, but I think the current conclusion is within the margin of error.

sascha1337 commented 1 year ago

@autumnmotor ser what sampler, ddim ?

atiorh commented 1 year ago

@autumnmotor Could you also report coreml(Python,use cpu,gpu) please?

juan9999 commented 1 year ago

There is a basic implementation now

https://github.com/godly-devotion/mochi-diffusion

Would be great if you guys somehow teamed up!

rjp23 commented 1 year ago

How would we go about testing the CoreML versions already converted? I assume I can't just drop all of the files into the models directory?

RnbWd commented 1 year ago

use MPS

how do we use MPS with the webUI? I thought it was only CPU

namnhfreelancer commented 1 year ago

Please do it

GrinZero commented 1 year ago

I'm a Mac user and I tried the Draw Things software that supports CoreML. On mac mini M1, same as anything v3, step30, it takes about 2min40s to generate a card using webui and 45s to generate a card using DT, so I think it is still necessary to support CoreML. I'm on another mac, it's M1 Pro, 20s for DT, 35s for webui. (In addition to teasing, some interface design reference to the DT)

9Somboon commented 1 year ago

please.

genevera commented 1 year ago

bump

I just compiled the HF Diffusers app on my M2Max and can whip out a 45-step sd2.1 image in about 18.5s vs 43s with A1111 and the pruned model

RnbWd commented 1 year ago

there are apps that use apple's core ML stable diffusion. The best one I could find is here: https://github.com/godly-devotion/MochiDiffusion

however, if you've ever tried using apples's core ml implementation, you might have noticed that it takes a LONG time to initialize the model everytime it first runs. Using the CLI from apple's examples, it takes like a minute on my m1 to before the model even starts running. I think Mochi caches the core ml making it more useful. On macbook air m1, I'm only seeing a 20% increase in diffusion speed, at most, but the startup time for loading any model makes it not worth it. A M1 pro / max, or M2 pro / max, might see much more significant gains than the m1 base model

sascha1337 commented 1 year ago

Dude this depends on the RAM 8gb u got bad times 128gb the way to go, keep UNET chunks in cache

foolyoghurt commented 1 year ago

Please do it!

genevera commented 11 months ago

DrawThings has an HTTP API. Maybe something could be done to send requests etc for things that it can handle over to that while keeping A1111 the front-end?

genevera commented 11 months ago

there are apps that use apple's core ML stable diffusion. The best one I could find is here: https://github.com/godly-devotion/MochiDiffusion

however, if you've ever tried using apples's core ml implementation, you might have noticed that it takes a LONG time to initialize the model everytime it first runs. Using the CLI from apple's examples, it takes like a minute on my m1 to before the model even starts running. I think Mochi caches the core ml making it more useful. On macbook air m1, I'm only seeing a 20% increase in diffusion speed, at most, but the startup time for loading any model makes it not worth it. A M1 pro / max, or M2 pro / max, might see much more significant gains than the m1 base model

Check out Draw Things... it's not open source but it is free and it beats everything else in performance, I think.

marshalleq commented 9 months ago

Dude this depends on the RAM 8gb u got bad times 128gb the way to go, keep UNET chunks in cache

128GB Mac memory lol. Apples golden money earner. Apple silicon only currently goes up to 96 I believe.