Error Converting SD Model @ 768x768 For Use With ControlNet

I am trying to convert the basic Stable Diffusion v1.5 model downloaded from https://huggingface.co/runwayml/stable-diffusion-v1-5 from diffusers format to coreml original 768x768 for use with ControlNet.

My command line is:

python -m python_coreml_stable_diffusion.torch2coreml --convert-unet --convert-text-encoder --convert-vae-encoder --convert-vae-decoder --unet-support-controlnet --model-version "./diffusers" --bundle-resources-for-swift-cli --attention-implementation ORIGINAL --latent-h 96 --latent-w 96 --compute-unit CPU_AND_GPU -o "./SD15-Original-768x768"

The error message is:

RuntimeError: The size of tensor a (64) must match the size of tensor b (96) at non-singleton dimension 3

Tests runs:

without --unet-support-controlnet , @ 512x512 -- OK without --unet-support-controlnet , @ 768x768 -- OK with --unet-support-controlnet , @ 512x512 -- OK with --unet-support-controlnet , @ 768x768 -- FAIL

Pipeline 1 uses: coremltools 6.2 diffusers 0.14.0 python 3.8

Pipeline 2 uses: coremltools 6.3 diffusers 0.15.1 python 3.10

Behavior is identical in both pipelines.

Appears to complete Stable_Diffusion_version_diffusers_vae_decoder.mlpackage Stable_Diffusion_version_diffusers_vae-encoder.mlpackage Errors when starting Stable_Diffusion_version_diffusers_control-unet.mlpackage

Full Terminal output:

(python_playground) jrittvo@M1PRO convert % python -m python_coreml_stable_diffusion.torch2coreml --convert-unet --convert-text-encoder --convert-vae-encoder --convert-vae-decoder --unet-support-controlnet --model-version "./diffusers" --bundle-resources-for-swift-cli --attention-implementation ORIGINAL --latent-h 96 --latent-w 96 --compute-unit CPU_AND_GPU -o "./SD15-Original-768x768"

INFO:main:Initializing StableDiffusionPipeline with ./diffusers.. /Users/jrittvo/miniconda3/envs/python_playground/lib/python3.10/site-packages/transformers/models/clip/feature_extraction_clip.py:28: FutureWarning: The class CLIPFeatureExtractor is deprecated and will be removed in version 5 of Transformers. Please use CLIPImageProcessor instead. warnings.warn( text_config_dict is provided which will be used to initialize CLIPTextConfig. The value text_config["id2label"] will be overriden. INFO:main:Done. INFO:main:Attention implementation in effect: AttentionImplementations.ORIGINAL

INFO:main:Converting vae_decoder /Users/jrittvo/miniconda3/envs/python_playground/lib/python3.10/site-packages/diffusers/models/resnet.py:127: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs! assert hidden_states.shape[1] == self.channels /Users/jrittvo/miniconda3/envs/python_playground/lib/python3.10/site-packages/diffusers/models/resnet.py:140: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs! if hidden_states.shape[0] >= 64: INFO:main:Converting vae_decoder to CoreML.. Converting PyTorch Frontend ==> MIL Ops: 0%| | 0/426 [00:00<?, ? ops/s]WARNING:main:Casted the beta(value=0.0) argument of baddbmm op from int32 to float32 dtype for conversion! Converting PyTorch Frontend ==> MIL Ops: 100%|▉| 425/426 [00:00<00:00, 2270.48 o Running MIL frontend_pytorch pipeline: 100%|█| 5/5 [00:00<00:00, 330.30 passes/s Running MIL default pipeline: 100%|████████| 57/57 [00:03<00:00, 17.60 passes/s] Running MIL backend_mlprogram pipeline: 100%|█| 10/10 [00:00<00:00, 671.22 passe INFO:main:Saved vae_decoder model to ./SD15-Original-768x768/Stable_Diffusionversion._diffusers_vae_decoder.mlpackage INFO:main:Saved vae_decoder into ./SD15-Original-768x768/Stable_Diffusionversion._diffusers_vae_decoder.mlpackage INFO:main:Converted vae_decoder

INFO:main:Converting vae_encoder /Users/jrittvo/miniconda3/envs/python_playground/lib/python3.10/site-packages/diffusers/models/resnet.py:200: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs! assert hidden_states.shape[1] == self.channels /Users/jrittvo/miniconda3/envs/python_playground/lib/python3.10/site-packages/diffusers/models/resnet.py:205: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs! assert hidden_states.shape[1] == self.channels INFO:main:Converting vae_encoder to CoreML.. Converting PyTorch Frontend ==> MIL Ops: 0%| | 0/354 [00:00<?, ? ops/s]WARNING:main:Casted the beta(value=0.0) argument of baddbmm op from int32 to float32 dtype for conversion! Converting PyTorch Frontend ==> MIL Ops: 100%|▉| 353/354 [00:00<00:00, 2195.42 o Running MIL frontend_pytorch pipeline: 100%|█| 5/5 [00:00<00:00, 455.95 passes/s Running MIL default pipeline: 100%|████████| 57/57 [00:02<00:00, 28.16 passes/s] Running MIL backend_mlprogram pipeline: 100%|█| 10/10 [00:00<00:00, 869.65 passe INFO:main:Saved vae_encoder model to ./SD15-Original-768x768/Stable_Diffusionversion._diffusers_vae_encoder.mlpackage INFO:main:Saved vae_encoder into ./SD15-Original-768x768/Stable_Diffusionversion._diffusers_vae_encoder.mlpackage INFO:main:Converted vae_encoder

INFO:main:Converting unet INFO:main:Sample UNet inputs spec: {'sample': (torch.Size([2, 4, 96, 96]), torch.float32), 'timestep': (torch.Size([2]), torch.float32), 'encoder_hidden_states': (torch.Size([2, 768, 1, 77]), torch.float32), 'additional_residual_0': (torch.Size([2, 320, 64, 64]), torch.float32), 'additional_residual_1': (torch.Size([2, 320, 64, 64]), torch.float32), 'additional_residual_2': (torch.Size([2, 320, 64, 64]), torch.float32), 'additional_residual_3': (torch.Size([2, 320, 32, 32]), torch.float32), 'additional_residual_4': (torch.Size([2, 640, 32, 32]), torch.float32), 'additional_residual_5': (torch.Size([2, 640, 32, 32]), torch.float32), 'additional_residual_6': (torch.Size([2, 640, 16, 16]), torch.float32), 'additional_residual_7': (torch.Size([2, 1280, 16, 16]), torch.float32), 'additional_residual_8': (torch.Size([2, 1280, 16, 16]), torch.float32), 'additional_residual_9': (torch.Size([2, 1280, 8, 8]), torch.float32), 'additional_residual_10': (torch.Size([2, 1280, 8, 8]), torch.float32), 'additional_residual_11': (torch.Size([2, 1280, 8, 8]), torch.float32), 'additional_residual_12': (torch.Size([2, 1280, 8, 8]), torch.float32)} INFO:main:JIT tracing.. /Users/jrittvo/miniconda3/envs/python_playground/lib/python3.10/site-packages/python_coreml_stable_diffusion/layer_norm.py:61: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs! assert inputs.size(1) == self.num_channels

Traceback (most recent call last): File "/Users/jrittvo/miniconda3/envs/python_playground/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/Users/jrittvo/miniconda3/envs/python_playground/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/Users/jrittvo/miniconda3/envs/python_playground/lib/python3.10/site-packages/python_coreml_stable_diffusion/torch2coreml.py", line 1282, in main(args) File "/Users/jrittvo/miniconda3/envs/python_playground/lib/python3.10/site-packages/python_coreml_stable_diffusion/torch2coreml.py", line 1147, in main convert_unet(pipe, args) File "/Users/jrittvo/miniconda3/envs/python_playground/lib/python3.10/site-packages/python_coreml_stable_diffusion/torch2coreml.py", line 688, in convert_unet reference_unet = torch.jit.trace(reference_unet, File "/Users/jrittvo/miniconda3/envs/python_playground/lib/python3.10/site-packages/torch/jit/_trace.py", line 794, in trace return trace_module( File "/Users/jrittvo/miniconda3/envs/python_playground/lib/python3.10/site-packages/torch/jit/_trace.py", line 1056, in trace_module module._c._create_method_from_trace( File "/Users/jrittvo/miniconda3/envs/python_playground/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, *kwargs) File "/Users/jrittvo/miniconda3/envs/python_playground/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1488, in _slow_forward result = self.forward(input, **kwargs)

File "/Users/jrittvo/miniconda3/envs/python_playground/lib/python3.10/site-packages/python_coreml_stable_diffusion/unet.py", line 972, in forward down_block_res_sample = down_block_res_sample + additional_residuals[i]

RuntimeError: The size of tensor a (96) must match the size of tensor b (64) at non-singleton dimension 3

apple / ml-stable-diffusion

Error Converting SD Model @ 768x768 For Use With ControlNet #165