[Performance]A17 PRO ANE has much more computing power than M2, but its stable diffusion performance is worse than M2?

apple / ml-stable-diffusion

Stable Diffusion with Core ML on Apple Silicon

MIT License

16.95k stars 950 forks source link

[Performance]A17 PRO ANE has much more computing power than M2, but its stable diffusion performance is worse than M2? #283

Open AndreaChiChengdu opened 1 year ago

AndreaChiChengdu commented 1 year ago

Hello, as the title indicates and snapshot from the benchmark of stable diffusion xl in this project, it can be seen that the performance of A17Pro 35T ANE is worse than M2 ANE 15.8T. Is there any other reason besides the large memory bandwith gap?

It seems that the A17Pro 35T's high computing power is not being used very effectively at all.

AndreaChiChengdu commented 1 year ago

As a supplement, I used diffusers app to inference SD1.5 using ANE on my iPhone15 pro(A17Pro), and I found that E2E times were about the same as M2. And the A17Pro has very limited improvement over the A16 17TOPs ANE(with the same mem bandwith)。Is it because of memory bound？

I am very confused. How can I find the answer to this question?

TimYao18 commented 1 year ago

https://www.cpu-monkey.com/en/compare_cpu-apple_a17_pro-vs-apple_m2_pro_12_cpu_19_gpu I think you might consider about their cores.

AndreaChiChengdu commented 1 year ago

https://www.cpu-monkey.com/en/compare_cpu-apple_a17_pro-vs-apple_m2_pro_12_cpu_19_gpu I think you might consider about their cores.

unet runs on ANE, as can be seen from the specifications. Both the A17Pro and M2 ANE have 16 cores, but the A17Pro is much more powerful, 35T VS 15.8T, but the performance is worse. It's incredible. any suggestions？ @TimYao18 @pcuenca

TimYao18 commented 1 year ago

You cannot just see the "ANE" part. The compute Unit is "CPU and NE". Maybe the CPU part add M2 score. Or just Apple got screwed.

When using CPU + ANE, CPU will also use a lot power.

AndreaChiChengdu commented 1 year ago

You cannot just see the "ANE" part. The compute Unit is "CPU and NE". Maybe the CPU part add M2 score. Or just Apple got screwed.

When using CPU + ANE, CPU will also use a lot power.

The CPU will always have power consumption, that's not the point.

I encourage you to use the instrument coreml template for further analysis, you will see that almost all the unet operators(99.89%) are executed on ANE. The cpu has only a very small amount of workload. It is also very small compared to the latency of ANE computation.

Our view from a more microscopic decomposition point of view is that the time of unet ANE computation is already slightly slower than M2. anyway,thanks for your reply,buddy, have a great weekend~

AndreaChiChengdu commented 1 year ago

截屏2023-11-10 18 24 10

TimYao18 commented 1 year ago

Thank you for your information.

I met similar problem on M2 Pro and M2 that M2 Pro runs slower than M2, and when using computeUnit==All will run twice slower than CPU_AND_NE. Maybe I can use this to check if M2 Pro has something wrong that it runs slower than M2 when it runs through unet.