comfyanonymous / ComfyUI

The most powerful and modular diffusion model GUI, api and backend with a graph/nodes interface.
https://www.comfy.org/
GNU General Public License v3.0
54.72k stars 5.77k forks source link

Use ComfyUI (SDXL) on MacOS (MacBook Pro M1) #1905

Open 0x1337ff opened 11 months ago

0x1337ff commented 11 months ago

Hi

I try ComfyUI on my MacBook Pro M1.

On the Load Defaut Model and (revAnimated_v11.safetensors SD1.5) generate image take a few secondes, maibye 40-50 secondes is very quick and nice

But on the SDXL ... it's take 45mn, 1h i dont understund why :/

I try 2 models but is the same ...

I share you some screens if can help ?

Capture d’écran 2023-11-04 à 21 07 57

Capture d’écran 2023-11-04 à 21 08 06

Here, 2 template i have tested:

SDXL + Refiner (default).json Workflow SDXL BASE-REFINER-LORA.json

If can help i see this in console:


╭─    ~/ComfyUI    master !1 ?2 ──────────────────────────────────────────── ✔  21:14:44  ─╮
╰─ python3 main.py                                                                                  ─╯
Total VRAM 16384 MB, total RAM 16384 MB
Set vram state to: SHARED
Device: mps
VAE dtype: torch.float32
Using sub quadratic optimization for cross attention, if you have memory or speed issues try using: --use-split-cross-attention
Starting server

To see the GUI go to: http://127.0.0.1:8188
got prompt
model_type EPS
adm 2816
Using split attention in VAE
Working with z of shape (1, 4, 32, 32) = 4096 dimensions.
Using split attention in VAE
missing {'cond_stage_model.clip_l.text_projection', 'cond_stage_model.clip_l.logit_scale'}
left over keys: dict_keys(['cond_stage_model.clip_l.transformer.text_model.embeddings.position_ids'])
Requested to load SDXLClipModel
Loading 1 new model
Requested to load SDXL
Loading 1 new model
  0%|                                                                           | 0/20 [00:00<?, ?it/s]

/opt/homebrew/lib/python3.11/site-packages/torchsde/_brownian/brownian_interval.py:608: UserWarning: Should have tb<=t1 but got tb=14.614643096923828 and t1=14.614643.
  warnings.warn(f"Should have {tb_name}<=t1 but got {tb_name}={tb} and t1={self._end}.")

5%|███▏                                                            | 1/20 [03:52<1:13:45, 232.93s/it]

Time is AMAZING why ?

I have intalled pytorch (): Doc for MacOS: https://developer.apple.com/metal/pytorch/


╭─    ~ ─────────────────────────────────────────────────────────────────────────────────────── ✔  21:39:22  ─╮
╰─ python3                                                                                                          ─╯
Python 3.11.6 (main, Oct  2 2023, 13:45:54) [Clang 15.0.0 (clang-1500.0.40.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> if torch.backends.mps.is_available():
...     mps_device = torch.device("mps")
...     x = torch.ones(1, device=mps_device)
...     print (x)
... else:
...     print ("MPS device not found.")
...
tensor([1.], device='mps:0')
>>>

Other Test:

╭─    ~ ──────────────────────────────────────────────── ✔  23:36:57  ─╮
╰─ python3                                                                   ─╯
Python 3.11.6 (main, Oct  2 2023, 13:45:54) [Clang 15.0.0 (clang-1500.0.40.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> print(torch.__version__)
2.1.0
>>> print(torch.backends.mps.is_available())
True
>>> print(torch.backends.mps.is_built())
True
>>>
NeedsMoar commented 11 months ago

Multiple reasons: I'm not sure if you have the model with an M1 pro or the regular M1 (2020). The M1 has a combined memory bandwidth to the GPU lower than the main system memory had in my 2015 workstation. It can't use the full 16GB for either and models need to be in both places sometimes so there are copies... SDXL is huge. You can do the math on that. It's going to swap out and make things worse. The M1 pro isn't much better compared to a real video card. Current gen cards have memory running in excess of 1GB/s that isn't shared with anything.

Next up, the M1's GPU cores had a total of 2.6TFLOP FP32 performance, I think FP16 was double that. The M1 Pro was a bit under double. For some kind of comparison a 7900XTX is around 130TFLOPs of FP16 if fully loaded. The tensor cores alone in a 4090, running in BF16, can be up to 127x faster than that chip is likely to be running at since I think those run in fp32 mode, but that's only assuming it has access to the data when it needs it. The memory is likely to be the rest of the issue since a 4090 will run 25 iteration 1024x1024 SDXL generation without any of the recent things that speed it up in 4.09s

Unfortunately the highest end chips have tons of ram but the speed of it is still a limiter. If you search the web you'll find people getting identical inference speeds on larger models on M1 max and M2 max despite the supposed GPU speed increases, because they're running up against a memory speed wall. For some reason memory bandwidth on lower end M3 models got cut down to lower than the M2 so you won't have any luck there. Apple isn't the place to be if you want to run stable diffusion super quickly; by the time you get something with the highest core count GPU in it you could have bought a dual socket Epyc workstation, a couple of 4090s, and around a terabyte of ram.

Since you can't upgrade GPUs even in the M2 Mac Pros (I've been told that this is mainly because they used up all of the PCIe on thunderbolt so the PCIe slots in the thing are attached through a bridge chip with enough bandwidth for maybe one card... which would be fine except there's an PCIe coherency flaw in the M2's controller that they were never able to resolve, so hooking something as high bandwidth as a graphics accelerator up to it that wanted to use DMA would result in the system crashing constantly. They basically built a mac pro with a bunch of PCIe slots that can't be used reliably with anything that needs to be a PCIe card; I'm guessing they're counting on nobody trying to install 2 channel Mellanox 100GBe cards or anything (and can easily ensure that since nvidia would never write drivers for them). There is no mechanism to have a graphics driver aside from the integrated GPU, so it'll never happen.

About all you can do is try --force-fp16 on the command line and hope a quantized int8 version of SDXL gets released by somebody soon so it'll all fit in your system + graphics memory, otherwise it'll keep swapping out and be hopelessly slow. Since you can't add ram to those, no luck there either.