apple / ml-stable-diffusion

Stable Diffusion with Core ML on Apple Silicon
MIT License
16.67k stars 923 forks source link

Memory Issues on 4 GB iOS/iPadOS devices #291

Open MenKuch opened 10 months ago

MenKuch commented 10 months ago

Hello!

I am using a stable-diffusion-1.5-derived model (https://civitai.com/models/4384/dreamshaper) converted to CoreML to be used on Macs, iPhones or iPads (using the --quantize-nbits 6 option).

It works great on Macs, iPads with >=6GB RAM and iPhones with >=6GB RAM on either the ANE or CPU+GPU. I understand that due to memory constraints, the only option to use this stable-diffusion-model on devices with 4 GB RAM (like iPhone 12 for example) is to use the ANE since CPU+GPU would require more RAM.

I ran into several issue trying inference on the converted model on a 4GB iPhone with iOS 17.0.3 using the ANE. My own process uses around 100 MB of memory while other processes on the iPhone 12 use 2.5 to 3 GB during inference. Most of the time, this leads to:

Two times I was lucky to see the result of the inference, but most of the time, it won't work.

On https://github.com/apple/ml-stable-diffusion , it is stated that iOS 17 brings many optimisations including just-in-time decompression which should greatly reduce memory footprint. Since this stable-diffusion-1.5-derived model is running into memory constraints, I am guessing that this optimisation is not enabled/not available. The "reduceMemory"-option in the stable-diffusion-pipeline did not make any difference in my tests.

Any hints/insights on how to stay below the memory limit on 4GB iOS/iPadOS devices?

atiorh commented 10 months ago

The "reduceMemory"-option in the stable-diffusion-pipeline did not make any difference in my tests.

Do you mean it doesn't reduce the incidence rate of the issues or it doesn't change peak memory consumption at all? If the latter is the case, there is something wrong. It should make a huge difference in the peak memory consumption. The former is a plausible outcome.

On https://github.com/apple/ml-stable-diffusion , it is stated that iOS 17 brings many optimisations including just-in-time decompression which should greatly reduce memory footprint. Since this stable-diffusion-1.5-derived model is running into memory constraints, I am guessing that this optimisation is not enabled/not available

This model with --quantize-nbits 6 is expected to benefit from iOS17 improvements. v1-5 uses slightly more memory than v2-1 so I recommend looking at the Advanced Weight Compression section in the README to go below 6 bits.

is to use the ANE since CPU+GPU would require more RAM.

I wouldn't always assume this, would be useful to test cpuAndGPU as well just to observe latency and memory usage differences with cpuAndNeuralEngine.

MenKuch commented 10 months ago

Thanks for your response, highly appreciated.

I dug a little deeper and here are my findings on an iPhone 12 mini with iOS 17.0.3/17.1 regarding reduceMemory etc:

Using my own 6-bit stable-diffusion-1.5-derived model:

Using Apple-provided SD 1.5 6-bit-palettized split_einsum model (https://huggingface.co/apple/coreml-stable-diffusion-v1-5-palettized):

The only thing that currently works mostly reliable on 4 GB A14 Hardware is:

Since my test app does not do much else other than calling the provided swift classes, is it safe to assume that the Apple provided SD 1.5 models + the swift classes are not suitable in their current form on A14 with 4 GB? Anything else I could try or investigate besides reducing float size as described in "Advanced Weight Compression"?

atiorh commented 10 months ago

Thanks for your response, highly appreciated.

Of course!

SD 1.5 models + the swift classes are not suitable in their current form on A14 with 4 GB?

The only benchmarks involving iPhone 12 mini in the README were demonstrated on SD v2.1 so it could be the case that SD v1.5's slightly higher memory requirement is going over the system limits on a 4GB device with other system load. That being said, I would still introspect the app or the particular device for extra external memory usage.

Important: Prevent the call of prewarmResources in the swift sample code provided here (loadResources will be called before accessing the models). This is responsible for the HUGE memory spike in other iOS processes when generating a second image

This is super fishy because the prewarmResources call is supposed to incur "extra" memory spikes due to Neural Engine compilation only during the initial load (pre-first image) because the resulting compiled assets are cached for later use. The fact that you are getting spikes on the second load makes me think that your device is running low on storage and the caches have to be purged. Can you confirm that your "time to see the first UNet step" for your second image is roughly similar to that of the first image?

MenKuch commented 10 months ago

Again, thanks for your awesome support.

We tried it on two iPhone 12 models (regular and mini) with iOS 17.0.3 and iOS 17.1. Both devices have plenty of free space (>30 GB) and I even erased one of the iPhone 12s before trying again – same results.

I think I found a new clue: The safety checker model. It seems that compiling the safety checker model for ANE fails ("E5RT encountered an STL exception. msg = MILCompilerForANE error: failed to compile ANE model using ANEF. Error=_ANECompiler : ANECCompile() FAILED.") – and somehow, this is responsible for the huge memory spike I am seeing on generating a second image in prewarmResources. I still do not have all pieces of the puzzle (and contradictory run-run-variations), but deactivating the safety checker OR skipping prewarmResources keeps me below the memory ceiling.

Of course, I am not happy with this situation on 4GB devices so I think I will try to convert my models to below-6-bit-weights as described in „Advanced Weight Compression“ to save more memory.

On >=6 GB devices (and especially on M1 & M2 iPads and Macs), everything is working fabulous. Thanks for your outstanding work.