Closed jo32 closed 1 year ago
same issue, I remember that the feature of img2img was submitted by @littleowl , can you solve this problem? ❤️ I have tested several models, for example: for txt2txt, the size is 512✖️768; however, in img2img mode, need to submit start image with a size of 768✖️512. This is likely a bug.
This project does not support flexible shapes unfortunately. Currently, you have to create the various models for each size that you want. However the weights inside there model are the same and it appears only the arch and the meta data are different. When trying to implement flexible shapes I was running into issues which I made issues for. Hypotheses:
Anyways, this does work if you create the VAE and probably the UNET for specific sizes.
It also looks like there may be an orientation problem with your input image. I'd recommend ensuring the images are portrait orientation.if it's an image orientation issues then that could maybe be solved in the pre-processing.
It also looks like there may be an orientation problem with your input image. I'd recommend ensuring the images are portrait orientation.if it's an image orientation issues then that could maybe be solved in the pre-processing.
@littleowl If possible, could you please provide a viable example using a 512x768 image as input? I have already tried rotating the image but still no success(weird output). Thank you in advance.
same issue, I remember that the feature of img2img was submitted by @littleowl , can you solve this problem? ❤️ I have tested several models, for example: for txt2txt, the size is 512✖️768; however, in img2img mode, need to submit start image with a size of 768✖️512. This is likely a bug.
Are you saying that the height and width attributes are "transposed" depending on the pipeline selected? For example, a model built to 512x768 that produces a 512x768 image using text2image actually wants a 768x512 input image and then will then produce a 768x512 output, with image2image instead of text2image?
Are you saying that the height and width attributes are "transposed" depending on the pipeline selected? For example, a model built to 512x768 that produces a 512x768 image using text2image actually wants a 768x512 input image and then will then produce a 768x512 output, with image2image instead of text2image?
yes
Interesting. I use an app, Mochi Diffusion https://github.com/godly-devotion/MochiDiffusion
where the developer was not able to get image2image in 512x768 or 768x512 to work, so he built in a check that blocks anything other than 512x512 to be used in image2image. I'll link him to this bit of this thread and perhaps he can build in a workaround for his app until the underlying issue is resolved. Thanks for the lead.
Are you all creating a new encoder model for each size you are wishing to accept? Or are you just using an image that is not 512x512 on a model that is? This project does not yet support dynamic image sizes.
I believe that PyTorch also orders the shape like: (batch, height, width, channels) or (batch, channels, height, width) so, maybe some confusion there?
Previously, (I'm not sure the state now) but changing the latent width and height options for running the coreml tools was actually broken or insufficient in my experience and if I wanted to create models that render different sizes - then I'd hard code some values in the python script. By this process I've been able to get different sizes to work. No problem at all. I don't imagine much has changed so I'm curious how you all are acquiring this pipeline with none 512x512 size. Because, as so far as I am aware, the UNET also needs a specialized model to run different sizes of latent space.
I've previously encountered issues (that I've reported) when trying to implement dynamic shapes. I was going to seek help from the coreml tools repo. Implementing dynamic shapes would be the ideal thing to do. From the beginning the read me in this repo has suggested that we could add such a feature. It has been many months since though, so it's worth trying again.
The Mochi Diffusion people had been trying with models built to render different image sizes, specifically 512x768 and 768x512. That is why I was intrigued by the possibility that h and w were being transposed somewhere. The developer had a pull merged here that fixed the h and w option issues in the scripts that I think you are referring to: https://github.com/apple/ml-stable-diffusion/pull/123
I attempted to use this model for image-to-image: https://huggingface.co/coreml/coreml-Grapefruit/blob/main/original/512x768/grapefruit41_original_512x768.zip
This model worked perfectly fine in text-to-image mode. However, when I passed a 512x768 image as the startingImage, I received an error from this line: https://github.com/apple/ml-stable-diffusion/blob/2c4e9de73c9e723de264356f9563706ea9104212/swift/StableDiffusion/pipeline/Encoder.swift#L89
It seems that the input of Encoder is expected to be [1, 3, 768, 512], but the image shape is [1, 3, 512, 768].
model description of VAEEncoder.modelc:
"formattedType" : "MultiArray (Float16 1 × 3 × 768 × 512)",