Open 0x1337ff opened 11 months ago
Multiple reasons: I'm not sure if you have the model with an M1 pro or the regular M1 (2020). The M1 has a combined memory bandwidth to the GPU lower than the main system memory had in my 2015 workstation. It can't use the full 16GB for either and models need to be in both places sometimes so there are copies... SDXL is huge. You can do the math on that. It's going to swap out and make things worse. The M1 pro isn't much better compared to a real video card. Current gen cards have memory running in excess of 1GB/s that isn't shared with anything.
Next up, the M1's GPU cores had a total of 2.6TFLOP FP32 performance, I think FP16 was double that. The M1 Pro was a bit under double. For some kind of comparison a 7900XTX is around 130TFLOPs of FP16 if fully loaded. The tensor cores alone in a 4090, running in BF16, can be up to 127x faster than that chip is likely to be running at since I think those run in fp32 mode, but that's only assuming it has access to the data when it needs it. The memory is likely to be the rest of the issue since a 4090 will run 25 iteration 1024x1024 SDXL generation without any of the recent things that speed it up in 4.09s
Unfortunately the highest end chips have tons of ram but the speed of it is still a limiter. If you search the web you'll find people getting identical inference speeds on larger models on M1 max and M2 max despite the supposed GPU speed increases, because they're running up against a memory speed wall. For some reason memory bandwidth on lower end M3 models got cut down to lower than the M2 so you won't have any luck there. Apple isn't the place to be if you want to run stable diffusion super quickly; by the time you get something with the highest core count GPU in it you could have bought a dual socket Epyc workstation, a couple of 4090s, and around a terabyte of ram.
Since you can't upgrade GPUs even in the M2 Mac Pros (I've been told that this is mainly because they used up all of the PCIe on thunderbolt so the PCIe slots in the thing are attached through a bridge chip with enough bandwidth for maybe one card... which would be fine except there's an PCIe coherency flaw in the M2's controller that they were never able to resolve, so hooking something as high bandwidth as a graphics accelerator up to it that wanted to use DMA would result in the system crashing constantly. They basically built a mac pro with a bunch of PCIe slots that can't be used reliably with anything that needs to be a PCIe card; I'm guessing they're counting on nobody trying to install 2 channel Mellanox 100GBe cards or anything (and can easily ensure that since nvidia would never write drivers for them). There is no mechanism to have a graphics driver aside from the integrated GPU, so it'll never happen.
About all you can do is try --force-fp16 on the command line and hope a quantized int8 version of SDXL gets released by somebody soon so it'll all fit in your system + graphics memory, otherwise it'll keep swapping out and be hopelessly slow. Since you can't add ram to those, no luck there either.
Hi
I try ComfyUI on my MacBook Pro M1.
On the Load Defaut Model and (revAnimated_v11.safetensors SD1.5) generate image take a few secondes, maibye 40-50 secondes is very quick and nice
But on the SDXL ... it's take 45mn, 1h i dont understund why :/
I try 2 models but is the same ...
I share you some screens if can help ?
Here, 2 template i have tested:
SDXL + Refiner (default).json Workflow SDXL BASE-REFINER-LORA.json
If can help i see this in console:
Time is AMAZING why ?
I have intalled pytorch (): Doc for MacOS: https://developer.apple.com/metal/pytorch/
Other Test: