apple / ml-stable-diffusion

Stable Diffusion with Core ML on Apple Silicon
MIT License
16.89k stars 942 forks source link

[Swift CLI] Optional image width and height? #64

Open ju-popov opened 1 year ago

ju-popov commented 1 year ago

Will there be support for optional image width and height?

ju-popov commented 1 year ago

I tried to convert the models with --latent-w 64 --latent-h 96 but on save after the generation (of the image) i get an error:

Building for debugging...
Build complete! (0.07s)
Loading resources and creating pipeline
(Note: This can take a while the first time using these resources)
Step 125 of 125  [mean: 1.75, median: 1.75, last 1.73] step/sec
2022-12-15 08:18:12.288 StableDiffusionSample[59344:2534586] -[NSNull featureNames]: unrecognized selector sent to instance 0x1f28556e8
2022-12-15 08:18:12.289 StableDiffusionSample[59344:2534586] *** Terminating app due to uncaught exception 'NSInvalidArgumentException', reason: '-[NSNull featureNames]: unrecognized selector sent to instance 0x1f28556e8'
*** First throw call stack:
(
    0   CoreFoundation                      0x000000019679f3f8 __exceptionPreprocess + 176
    1   libobjc.A.dylib                     0x00000001962eaea8 objc_exception_throw + 60
    2   CoreFoundation                      0x0000000196841c1c -[NSObject(NSObject) __retain_OA] + 0
    3   CoreFoundation                      0x0000000196705670 ___forwarding___ + 1600
    4   CoreFoundation                      0x0000000196704f70 _CF_forwarding_prep_0 + 96
    5   StableDiffusionSample               0x00000001002f8de4 $s15StableDiffusion7DecoderV6decodeySaySo10CGImageRefaGSay6CoreML13MLShapedArrayVySfGGKFAFSiXEfU1_ + 200
    6   StableDiffusionSample               0x00000001002fa54c $s15StableDiffusion7DecoderV6decodeySaySo10CGImageRefaGSay6CoreML13MLShapedArrayVySfGGKFAFSiXEfU1_TA + 28
    7   libswiftCore.dylib                  0x00000001a43e0038 $sSlsE3mapySayqd__Gqd__7ElementQzKXEKlF + 656
    8   StableDiffusionSample               0x00000001002f83a0 $s15StableDiffusion7DecoderV6decodeySaySo10CGImageRefaGSay6CoreML13MLShapedArrayVySfGGKF + 792
    9   StableDiffusionSample               0x000000010031290c $s15StableDiffusion0aB8PipelineV14decodeToImages_13disableSafetySaySo10CGImageRefaSgGSay6CoreML13MLShapedArrayVySfGG_SbtKF + 104
    10  StableDiffusionSample               0x00000001003116f8 $s15StableDiffusion0aB8PipelineV14generateImages6prompt10imageCount04stepH04seed13disableSafety9scheduler15progressHandlerSaySo10CGImageRefaSgGSS_S3iSbAA0aB9SchedulerOSbAC8ProgressVXEtKF + 3476
    11  StableDiffusionSample               0x00000001003231e0 $s18StableDiffusionCLI0aB6SampleV3runyyKF + 1564
    12  StableDiffusionSample               0x0000000100328374 $s18StableDiffusionCLI0aB6SampleV14ArgumentParser15ParsableCommandAadEP3runyyKFTW + 16
    13  StableDiffusionSample               0x00000001002722f0 $s14ArgumentParser15ParsableCommandPAAE4mainyySaySSGSgFZ + 436
    14  StableDiffusionSample               0x000000010027283c $s14ArgumentParser15ParsableCommandPAAE4mainyyFZ + 52
    15  StableDiffusionSample               0x000000010031fd80 StableDiffusionCLI_main + 100
    16  dyld                                0x000000019631be50 start + 2544
)
libc++abi: terminating with uncaught exception of type NSException
2022/12/15 08:18:12 signal: abort trap

Maybe this is not the right approach?

littleowl commented 1 year ago

The current python code to convert the UNET does not reference or use the --latent-w 64 --latent-h 96 parameters. But you can hard code them in increments of 64 + 8 * n However, in my experience, the decoder will save to the ml package, but it will fail in testing in the python script. I've only tested the ORIGINAL attention-implementation so far. And I get a different crash when computeUnits was set to .all. That kinda makes since because I thing the SPLIT_EINSUM is needed for the ANE? and the crash I was seeing was referencing the ANE. I'll redo with SPLIT_EINSUM and .all to see if that is solved or not. After setting var computeUnits: ComputeUnits = .cpuAndGPU I get the correct output (with a headless horse, lol):

a_photo_of_an_astronaut_riding_a_horse_on_mars 68962474 final

I've found other issues when trying to add flexible input shapes to the models, but I will have to make a separate issues for that...

ju-popov commented 1 year ago

yes! manage to do with SPLIT_EINSUM and cpuAndGPU

Looking for a way to use dynamic size + Swift (because the python code is loading the models too slow)

godly-devotion commented 1 year ago

@littleowl I was curious about supporting generating images with custom width and height as well. What other issues have you noticed when trying to add flexible input shapes to the models? If it wasn't too difficult to do I wanted to add those options to the frontend I was working on here.

littleowl commented 1 year ago

@godly-devotion I created two other related issues to this. Cannot create CoreML model with Flexible input shapes. and SPLIT_EINSUM - Kernel Panic when testing UNET created with height 96 and width 64 and SPLIT_EINSUM

I get errors when implementing the coremltools flexible input shapes. Possibly the next step would be to create an issue with coremltools or ask for help on the developer forums.

When using maple-diffusion, I was easily able to modify the width and height at model initialization time after some simple modifications to the code. I am not sure if it is possible to create the model with one aspect ratio and then change the aspect ratio without re-initializing the model. That is obviously the desire though. Or this might be a difference between the old type of CoreML Neural Network and the MLProgram or what not. From my reading, the old type is a bit more dynamic. However, I think we need to use the new type for the ANE. The weights from two generated models with different aspect ratio are exactly the same. So it would be a waste to distribute to users with just different aspect ratios. Maybe fine if you are using for personal use. My GUESS is that the architecture part of model is what is different. Theoretically, I could see that it MIGHT be possible to switch out the architectures and compile the .mlmodelc on device. For the model packages generated, they are just folders with a few different files. One for weights, one for architecture, and another for meta data - I think. - either by switching out the architecture portion of the package, or potentially using the open source protobuf to mess around. But then you are wasting gigs and gigs of space on the users device for the duplicated weights if we cannot have a flexible model architecture.

godly-devotion commented 1 year ago

@littleowl Gotcha. Looks like we need more info on the new ANE compatible models. I was able to modify and build 64w-96h (after some crazy swap action on my 32GB M1 Pro) but was having trouble figuring out how to pass the width and height to the CLI/library. What did you need to modify in order to pass the latent values?

hubin858130 commented 1 year ago

@godly-devotion I created two other related issues to this. Cannot create CoreML model with Flexible input shapes. and SPLIT_EINSUM - Kernel Panic when testing UNET created with height 96 and width 64 and SPLIT_EINSUM

I get errors when implementing the coremltools flexible input shapes. Possibly the next step would be to create an issue with coremltools or ask for help on the developer forums.

When using maple-diffusion, I was easily able to modify the width and height at model initialization time after some simple modifications to the code. I am not sure if it is possible to create the model with one aspect ratio and then change the aspect ratio without re-initializing the model. That is obviously the desire though. Or this might be a difference between the old type of CoreML Neural Network and the MLProgram or what not. From my reading, the old type is a bit more dynamic. However, I think we need to use the new type for the ANE. The weights from two generated models with different aspect ratio are exactly the same. So it would be a waste to distribute to users with just different aspect ratios. Maybe fine if you are using for personal use. My GUESS is that the architecture part of model is what is different. Theoretically, I could see that it MIGHT be possible to switch out the architectures and compile the .mlmodelc on device. For the model packages generated, they are just folders with a few different files. One for weights, one for architecture, and another for meta data - I think. - either by switching out the architecture portion of the package, or potentially using the open source protobuf to mess around. But then you are wasting gigs and gigs of space on the users device for the duplicated weights if we cannot have a flexible model architecture.

Hello~~Could you share maple-diffusion code that you have modified the width and height?

godly-devotion commented 1 year ago

Figured out how to do it, I've updated the wiki here.

Here is the patch for modifying the torch2coreml.py script to create a model that will generate images in 512x768.

diff --git a/python_coreml_stable_diffusion/torch2coreml.py b/python_coreml_stable_diffusion/torch2coreml.py
index 6d6c2fa..9b11052 100644
--- a/python_coreml_stable_diffusion/torch2coreml.py
+++ b/python_coreml_stable_diffusion/torch2coreml.py
@@ -324,7 +324,7 @@ def convert_text_encoder(pipe, args):
 def modify_coremltools_torch_frontend_badbmm():
     """
     Modifies coremltools torch frontend for baddbmm to be robust to the `beta` argument being of non-float dtype:
-    e.g. https://github.com/huggingface/diffusers/blob/v0.8.1/src/diffusers/models/attention.py#L315 
+    e.g. https://github.com/huggingface/diffusers/blob/v0.8.1/src/diffusers/models/attention.py#L315
     """
     from coremltools.converters.mil import register_torch_op
     from coremltools.converters.mil.mil import Builder as mb
@@ -387,8 +387,8 @@ def convert_vae_decoder(pipe, args):
     z_shape = (
         1,  # B
         pipe.vae.latent_channels,  # C
-        args.latent_h or pipe.unet.config.sample_size,  # H
-        args.latent_w or pipe.unet.config.sample_size,  # w
+        96,  # H
+        64,  # w
     )

     sample_vae_decoder_inputs = {
@@ -484,8 +484,8 @@ def convert_unet(pipe, args):
         sample_shape = (
             batch_size,                    # B
             pipe.unet.config.in_channels,  # C
-            pipe.unet.config.sample_size,  # H
-            pipe.unet.config.sample_size,  # W
+            96,  # H
+            64,  # W
         )

         if not hasattr(pipe, "text_encoder"):
@@ -622,8 +622,8 @@ def convert_safety_checker(pipe, args):

     sample_image = np.random.randn(
         1,  # B
-        pipe.vae.config.sample_size,  # H
-        pipe.vae.config.sample_size,  # w
+        96,  # H
+        64,  # w
         3  # C
     ).astype(np.float32)

@@ -757,7 +757,7 @@ def convert_safety_checker(pipe, args):
     coreml_safety_checker.input_description["clip_input"] = \
         "The normalized image input tensor resized to (224x224) in channels-first (BCHW) format"
     coreml_safety_checker.input_description["images"] = \
-        f"Output of the vae_decoder ({pipe.vae.config.sample_size}x{pipe.vae.config.sample_size}) in channels-last (BHWC) format"
+        f"Output of the vae_decoder (96x64) in channels-last (BHWC) format"
     coreml_safety_checker.input_description["adjustment"] = \
         "Bias added to the concept scores to trade off increased recall for reduce precision in the safety checker classifier"

@@ -847,19 +847,19 @@ def parser_spec():
     parser.add_argument("--compute-unit",
                         choices=tuple(cu
                                       for cu in ct.ComputeUnit._member_names_),
-                        default="ALL")
+                        default="CPU_AND_GPU")

     parser.add_argument(
         "--latent-h",
         type=int,
-        default=None,
+        default=96,
         help=
         "The spatial resolution (number of rows) of the latent space. `Defaults to pipe.unet.config.sample_size`",
     )
     parser.add_argument(
         "--latent-w",
         type=int,
-        default=None,
+        default=64,
         help=
         "The spatial resolution (number of cols) of the latent space. `Defaults to pipe.unet.config.sample_size`",
     )