When running Image to Image, the Error.sampleInputShapeNotCorrect is thrown if the model output is 512x768 and the starting image is also 512x768.

jo32 commented 1 year ago

I attempted to use this model for image-to-image: https://huggingface.co/coreml/coreml-Grapefruit/blob/main/original/512x768/grapefruit41_original_512x768.zip

This model worked perfectly fine in text-to-image mode. However, when I passed a 512x768 image as the startingImage, I received an error from this line: https://github.com/apple/ml-stable-diffusion/blob/2c4e9de73c9e723de264356f9563706ea9104212/swift/StableDiffusion/pipeline/Encoder.swift#L89

It seems that the input of Encoder is expected to be [1, 3, 768, 512], but the image shape is [1, 3, 512, 768].

model description of VAEEncoder.modelc:

[
  {
    "shortDescription" : "Stable Diffusion generates images conditioned on text and\/or other images as input through the diffusion process. Please refer to https:\/\/arxiv.org\/abs\/2112.10752 for details.",
    "metadataOutputVersion" : "3.0",
    "outputSchema" : [
      {
        "hasShapeFlexibility" : "0",
        "isOptional" : "0",
        "dataType" : "Float32",
        "formattedType" : "MultiArray (Float32)",
        "shortDescription" : "The latent embeddings from the unet model from the input image.",
        "shape" : "[]",
        "name" : "latent_dist",
        "type" : "MultiArray"
      }
    ],
    "version" : ".\/diffusers",
    "modelParameters" : [

    ],
    "author" : "Please refer to the Model Card available at huggingface.co\/.\/diffusers",
    "specificationVersion" : 7,
    "storagePrecision" : "Float16",
    "license" : "OpenRAIL (https:\/\/huggingface.co\/spaces\/CompVis\/stable-diffusion-license)",
    "mlProgramOperationTypeHistogram" : {
      "Transpose" : 7,
      "Ios16.exp" : 1,
      "Ios16.reduceMean" : 44,
      "Ios16.softmax" : 1,
      "Split" : 1,
      "Ios16.linear" : 4,
      "Ios16.add" : 35,
      "Ios16.realDiv" : 22,
      "Ios16.square" : 22,
      "Pad" : 3,
      "Ios16.sub" : 22,
      "Ios16.cast" : 1,
      "Ios16.clip" : 1,
      "Ios16.conv" : 28,
      "Ios16.matmul" : 2,
      "Ios16.reshape" : 54,
      "Ios16.batchNorm" : 22,
      "Ios16.silu" : 21,
      "Ios16.sqrt" : 22,
      "Ios16.mul" : 6
    },
    "computePrecision" : "Mixed (Float32, Float16, Int32)",
    "isUpdatable" : "0",
    "availability" : {
      "macOS" : "13.0",
      "tvOS" : "16.0",
      "watchOS" : "9.0",
      "iOS" : "16.0",
      "macCatalyst" : "16.0"
    },
    "modelType" : {
      "name" : "MLModelType_mlProgram"
    },
    "inputSchema" : [
      {
        "hasShapeFlexibility" : "0",
        "isOptional" : "0",
        "dataType" : "Float16",
        "formattedType" : "MultiArray (Float16 1 × 3 × 768 × 512)",
        "shortDescription" : "An image of the correct size to create the latent space with, image2image and in-painting.",
        "shape" : "[1, 3, 768, 512]",
        "name" : "sample",
        "type" : "MultiArray"
      },
      {
        "hasShapeFlexibility" : "0",
        "isOptional" : "0",
        "dataType" : "Float16",
        "formattedType" : "MultiArray (Float16 1 × 4 × 96 × 64)",
        "shortDescription" : "Latent noise for `DiagonalGaussianDistribution` operation.",
        "shape" : "[1, 4, 96, 64]",
        "name" : "diagonal_noise",
        "type" : "MultiArray"
      },
      {
        "hasShapeFlexibility" : "0",
        "isOptional" : "0",
        "dataType" : "Float16",
        "formattedType" : "MultiArray (Float16 1 × 4 × 96 × 64)",
        "shortDescription" : "Latent noise for use with strength parameter of image2image",
        "shape" : "[1, 4, 96, 64]",
        "name" : "noise",
        "type" : "MultiArray"
      },
      {
        "hasShapeFlexibility" : "0",
        "isOptional" : "0",
        "dataType" : "Float16",
        "formattedType" : "MultiArray (Float16 1 × 1)",
        "shortDescription" : "Precalculated `sqrt_alphas_cumprod` value based on strength and the current schedular's alphasCumprod values",
        "shape" : "[1, 1]",
        "name" : "sqrt_alphas_cumprod",
        "type" : "MultiArray"
      },
      {
        "hasShapeFlexibility" : "0",
        "isOptional" : "0",
        "dataType" : "Float16",
        "formattedType" : "MultiArray (Float16 1 × 1)",
        "shortDescription" : "Precalculated `sqrt_one_minus_alphas_cumprod` value based on strength and the current schedular's alphasCumprod values",
        "shape" : "[1, 1]",
        "name" : "sqrt_one_minus_alphas_cumprod",
        "type" : "MultiArray"
      }
    ],
    "userDefinedMetadata" : {
      "com.github.apple.coremltools.version" : "6.2",
      "com.github.apple.coremltools.source" : "torch==1.13.1"
    },
    "generatedClassName" : "Stable_Diffusion_version___diffusers_vae_encoder",
    "method" : "predict"
  }
]

"formattedType" : "MultiArray (Float16 1 × 3 × 768 × 512)",

jiangdi0924 commented 1 year ago

same issue, I remember that the feature of img2img was submitted by @littleowl , can you solve this problem? ❤️ I have tested several models, for example: for txt2txt, the size is 512✖️768; however, in img2img mode, need to submit start image with a size of 768✖️512. This is likely a bug.

littleowl commented 1 year ago

This project does not support flexible shapes unfortunately. Currently, you have to create the various models for each size that you want. However the weights inside there model are the same and it appears only the arch and the meta data are different. When trying to implement flexible shapes I was running into issues which I made issues for. Hypotheses:

sizes larger than 512x512 could be too much memory air ANE to handle
this project uses the new style of coreml file whereas I wonder if the old style of "Neural Network" might work because it is supposed to be more dynamic with the sizes of tensors I think. But if that's true it might have an impact on performance?
when getting dynamic sizes to work with other libraries pre-coreml, I had to initializes the models at a certain size and re-create them for different sizes which would cause load time. This experience is what makes me wonder about the support for flexible images.
- if there is not a fix for the flexible input shapes then it might be possible to splice the model bundles together at runtime - since the models weights are the same and they are packaged in a bundle.

Anyways, this does work if you create the VAE and probably the UNET for specific sizes.

littleowl commented 1 year ago

It also looks like there may be an orientation problem with your input image. I'd recommend ensuring the images are portrait orientation.if it's an image orientation issues then that could maybe be solved in the pre-processing.

jo32 commented 1 year ago

It also looks like there may be an orientation problem with your input image. I'd recommend ensuring the images are portrait orientation.if it's an image orientation issues then that could maybe be solved in the pre-processing.

@littleowl If possible, could you please provide a viable example using a 512x768 image as input? I have already tried rotating the image but still no success(weird output). Thank you in advance.

jrittvo commented 1 year ago

same issue, I remember that the feature of img2img was submitted by @littleowl , can you solve this problem? ❤️ I have tested several models, for example: for txt2txt, the size is 512✖️768; however, in img2img mode, need to submit start image with a size of 768✖️512. This is likely a bug.

Are you saying that the height and width attributes are "transposed" depending on the pipeline selected? For example, a model built to 512x768 that produces a 512x768 image using text2image actually wants a 768x512 input image and then will then produce a 768x512 output, with image2image instead of text2image?

jiangdi0924 commented 1 year ago

Are you saying that the height and width attributes are "transposed" depending on the pipeline selected? For example, a model built to 512x768 that produces a 512x768 image using text2image actually wants a 768x512 input image and then will then produce a 768x512 output, with image2image instead of text2image?

yes

jrittvo commented 1 year ago

Interesting. I use an app, Mochi Diffusion https://github.com/godly-devotion/MochiDiffusion where the developer was not able to get image2image in 512x768 or 768x512 to work, so he built in a check that blocks anything other than 512x512 to be used in image2image. I'll link him to this bit of this thread and perhaps he can build in a workaround for his app until the underlying issue is resolved. Thanks for the lead.

littleowl commented 1 year ago

Are you all creating a new encoder model for each size you are wishing to accept? Or are you just using an image that is not 512x512 on a model that is? This project does not yet support dynamic image sizes.

I believe that PyTorch also orders the shape like: (batch, height, width, channels) or (batch, channels, height, width) so, maybe some confusion there?

Previously, (I'm not sure the state now) but changing the latent width and height options for running the coreml tools was actually broken or insufficient in my experience and if I wanted to create models that render different sizes - then I'd hard code some values in the python script. By this process I've been able to get different sizes to work. No problem at all. I don't imagine much has changed so I'm curious how you all are acquiring this pipeline with none 512x512 size. Because, as so far as I am aware, the UNET also needs a specialized model to run different sizes of latent space.

I've previously encountered issues (that I've reported) when trying to implement dynamic shapes. I was going to seek help from the coreml tools repo. Implementing dynamic shapes would be the ideal thing to do. From the beginning the read me in this repo has suggested that we could add such a feature. It has been many months since though, so it's worth trying again.

jrittvo commented 1 year ago

The Mochi Diffusion people had been trying with models built to render different image sizes, specifically 512x768 and 768x512. That is why I was intrigued by the possibility that h and w were being transposed somewhere. The developer had a pull merged here that fixed the h and w option issues in the scripts that I think you are referring to: https://github.com/apple/ml-stable-diffusion/pull/123

apple / ml-stable-diffusion

When running Image to Image, the Error.sampleInputShapeNotCorrect is thrown if the model output is 512x768 and the starting image is also 512x768. #143