Misleading benchmarks? - Githubissues

philipturner commented 1 year ago

The benchmarks only include inference latency, but the actual latency is much larger. For example, they say it takes 18 seconds on the 32c M1 Max, which I have validated. However, there's an additional 22-second latency before that where it says Sampling.... I pulled it up in Activity Monitor, and here's what's happening:

Loading resources and creating pipeline - 2 seconds, because I've already run the model several times
Sampling... - 99% CPU, ~0% GPU, which means one CPU core utilized through this entire step (not multi-core), 22 seconds
Step 50 of 50 [mean: 0.99, median: 1.56, last 1.55] step/sec - ~0% CPU, 88% GPU, which means the actual model is running, 18 seconds
Total time: 40 seconds

Is anyone else getting these wierd results? Is it the same, or much larger than 22 seconds? I don't know whether it's because I used the Swift CLI instead of the Python CLI. I cannot get the Python CLI to work: https://github.com/apple/ml-stable-diffusion/issues/43#issuecomment-1344970169.

philipturner commented 1 year ago

Variations in execution time based on batch size:

BATCH_SIZE={1,2,3,4,5}
swift run StableDiffusionSample "a photo of an astronaut riding a horse on mars" \
--resource-path ../mlpackages/Resources --seed 93 --output-path ../outputs \
--compute-units cpuAndGPU --disable-safety --image-count=BATCH_SIZE

Batch Size	Loading Resources	Sampling	Inference (50 steps)	Total
1	4 sec	17 sec	16 sec	37 sec
2	4 sec	20 sec	32 sec	56 sec
3	4 sec	20 sec	49 sec	73 sec
4	4 sec	20 sec	66 sec	90 sec
5	4 sec	21 sec	81 sec	106 sec

Measured manually with the iPhone Timer app, so results may deviate from actual values by ~2 seconds.

philipturner commented 1 year ago

I guess the benchmarks aren't entirely wrong. The throughput for batched images is 16 seconds/image - probably smaller than Apple's 18 sec because I disabled the NSFW filtering model.

However, Apple should warn users about the ~20 second static overhead. This would be important for people making one-off images where the 40-second feedback loop is their bottleneck, not absolute batched throughput.

littleowl commented 1 year ago

Curious what your setting is for the compute units. Try setting it to .all @Option(help: "Compute units to load model with {all,cpuOnly,cpuAndGPU,cpuAndNeuralEngine}") var computeUnits: ComputeUnits = .all I've not noticed the large sampling every startup - until I changed it to cpuAndGPU and then I can reproduce your findings. - referring to the CLI. With setting to .all, it starts up rather quickly just a second or two. The behavior might be different with different settings based on device and the memory capabilities. I would imagine, if you were building an application, that you could account for the setup time. Maple/Native Diffusion implementation has a similar initial startup penalty. I've not fully tested the recommended settings with ANE with this on devices yet, but maybe it can be faster on device with that setup? My guess is with such large models there is a cost loading all the weights to the GPU.

philipturner commented 1 year ago

referring to the CLI. With setting to .all, it starts up rather quickly just a second or two.

It worked! I had compiled the attention implementation to be GPU-friendly (ORIGINAL), although I did see ANECompilerService compiling something for the neural engine. Perhaps the original sampling pass occurred on the ANE, and the inference pass occurred on the GPU (with 70% utilization).

Latencies: 4 sec, 1 sec, 19 sec. I'll switch back to v1.5 and provide an updated table of latencies, along with performance when optimizing attention for the ANE. Meanwhile, here's the various power consumption metrics during the sampling state with .all:

Here

| Stage | Timestamp (s) | CPU (mW) | GPU (mW) | ANE (mW) | | ----- | ------------- | ------- | ------- | ------- | | Load | -0.3 | 2595 | 35 | 0 | | Load | -0.2 | 2815 | 0 | 0 | | Load | -0.1 | 4279 | 18 | 0 | | Sample | 0.0 | 3685 | 53 | 0 | | Sample | 0.1 | 2923 | 9 | 0 | | Sample | 0.2 | 2397 | 9 | 0 | | Sample | 0.3 | 2447 | 9 | 0 | | Sample | 0.4 | 2569 | 9 | 0 | | Sample | 0.5 | 3383 | 1563 | 0 | | Sample | 0.6 | 3611 | 88 | 283 | | Sample | 0.7 | 2622 | 5227 | 441 | | Sample | 0.8 | 1818 | 5393 | 3195 | | Sample | 0.9 | 1717 | 1903 | 3859 | | Sample | 1.0 | 2417 | 5144 | 759 | | Inference | 1.1 | 2531 | 14464 | 573 | | Inference | 1.2 | 440 | 11207 | 1549 | | Inference | 1.3 | 217 | 1224 | 4255 | | Inference | 1.4 | 508 | 13588 | 2359 | | Inference | 1.5 | 1439 | 18315 | 1324 |

Sampling is too quick to 100% prove whether it's actually utilizing the ANE, or just late to report that it started inferencing.

And here's the metrics with .cpuAndGPU (~36 watts during inference):

Here

| Stage | Timestamp (s) | CPU (mW) | GPU (mW) | ANE (mW) | | ----- | ------------- | ------- | ------- | ------- | | Sample | -0.5 | 1428 | 9 | 0 | | Sample | -0.4 | 1416 | 18 | 0 | | Sample | -0.3 | 2390 | 26 | 0 | | Sample | -0.2 | 1650 | 14378 | 0 | | Sample | -0.1 | 1982 | 40245 | 0 | | Inference | 0 | 1156 | 35068 | 0 | | Inference | 0.1 | 1105 | 33366 | 0 | | Inference | 0.2 | 839 | 40909 | 0 | | Inference | 0.3 | 1343 | 27957 | 0 | | Inference | 0.4 | 503 | 38959 | 0 |

philipturner commented 1 year ago

Note that if you try to re-run the command for generating a CoreML model, it will actually silently fail. You have to purge the mlpackages directory. I did not know this when switching between SPLIT_EINSUM and ORIGINAL previously.

BATCH_SIZE={1,2,3,4,5}
swift run StableDiffusionSample "a photo of an astronaut riding a horse on mars" \
--resource-path ../mlpackages/Resources --seed 93 --output-path ../outputs \
--compute-units all --disable-safety --image-count=BATCH_SIZE

With attention set to ORIGINAL (~15 watts during inference):

Here

| Batch Size | Loading Resources | Sampling | Inference (50 steps) | Total | | ---------- | ----------------- | -------- | -------------------- | ----- | | 1 | 3 | 1 | 19 | 24 | | 2 | 3 | 2 | 39 | 44 | | 3 | 4 | 2 | 60 | 65 | | 4 | 3 | 3 | 79 | 85 | | 5 | 3 | 3 | - | - | | 10 | 3 | 6 | - | - | | 20 | 3 | 9 | - | - | | 40 | 3 | 17 | - | - |

This seems to have marginally slower batched throughput (20 sec vs 16 sec), but about half the power consumption (15 W vs 36 W). Overall, it seems better than .cpuAndGPU. The GPU:ANE performance ratio stays the same on M1 Ultra, so these should be the best settings on all Apple silicon Macs.

With attention set to SPLIT_EINSUM (~13 watts during inference):

Here

| Batch Size | Loading Resources | Sampling | Inference (50 steps) | Total | | ---------- | ----------------- | -------- | -------------------- | ----- | | 1 | 5 | 1 | 22 | 29 | | 2 | 5 | 2 | 45 | 52 | | 3 | 5 | 2 | 68 | 75 |

With attention set to SPLIT_EINSUM and only .cpuAndNeuralEngine: (~3 watts during inference)

Here

| Batch Size | Loading Resources | Sampling | Inference (50 steps) | Total | | ---------- | ----------------- | -------- | -------------------- | ----- | | 1 | 4 | 1 | 39 | 44 | | 1 | 4 | 2 | 77 | 83 | | 1 | 4 | 3 | 116 | 122 |

philipturner commented 1 year ago

I've predicted the likely (actual) fastest implementation on each M1 model, and adjusted the numbers to match CLI latencies.

Device	`--compute-unit`	`--attention-implementation`	Latency (seconds)
Mac Studio (M1 Ultra, 64-core GPU)	`ALL`	`ORIGINAL`	9 -> 14
Mac Studio (M1 Ultra, 48-core GPU)	`ALL`	`ORIGINAL`	13 -> 18
MacBook Pro (M1 Max, 32-core GPU)	`ALL`	`ORIGINAL`	18 -> 24
MacBook Pro (M1 Max, 24-core GPU)	`ALL`	`ORIGINAL`	20 -> 26
MacBook Pro (M1 Pro, 16-core GPU)	`ALL`	`SPLIT_EINSUM`	26 -> 30
MacBook Pro (M1)	`CPU_AND_NE`	`SPLIT_EINSUM`	35 -> 39

Regarding battery life on M1 Max, there's a tradeoff between latency and power efficiency. You may want to use the neural engine when on battery. I assumed 3 W during load and sample, except for 1.5 W (sampling, .cpuAndGPU).

Compute Units	Attention	Runtime	Energy (J)	Inferences/Charge	Battery Life
`.cpuAndGPU`	`ORIGINAL`	37	614	~420	4 hours
`.all`	`ORIGINAL`	24	297	~870	6 hours
`.all`	`SPLIT_EINSUM`	29	304	~850	7 hours
`.cpuAndNeuralEngine`	`SPLIT_EINSUM`	44	132	~1960	24 hours

Assuming a 100 watt-hour battery at 90% health, or 324,000 joules. The battery will be drained from 90% to 10%, a typical real-world scenario.

hirakujira commented 1 year ago

The benchmark is still misleading. They said they could generate an image with M1 Ultra 48-core GPU within 13 seconds. And they didn't even use the swift package and neural engine!

The executed program is python_coreml_stable_diffusion.pipeline for macOS devices and a minimal Swift test app built on the StableDiffusion Swift package for iOS and iPadOS devices.

rovo79 commented 1 year ago

Stage Timestamp (s) CPU (mW) GPU (mW) ANE (mW)

How do you obtain these detailed mW readings of the process running?

philipturner commented 1 year ago

sudo powermetrics —sample_rate 100

rovo79 commented 1 year ago

Wow! I recall trying to use something like that a couple years ago but didn’t seem to exist on Macs. Was this recently re-added...

On Dec 20, 2022, at 8:33 AM, Philip Turner @.***> wrote:

powermetrics

apple / ml-stable-diffusion

Misleading benchmarks? #54