[Feature Request]: Use image as conditioning

ljleb commented 1 year ago

Is there an existing issue for this?

[X] I have searched the existing issues and checked the recent builds/commits

What would your feature do ?

As CLIP models map both images and text into a common latent space, I was wondering: what if, instead of encoding text for conditioning during sampling, we used an image?

I don't know much about machine learning and transformers, but I tried asking chatGPT to see if this would be possible. With what it replied with, I'm now certain that either 1) it doesn't know or 2) I don't know how to ask.

I know that SD checkpoints only include a text encoder, and no image encoder. I'm still curious, are there any model that could be used to map an image to an ok-ish clip space? (i.e. something that resembles the text->clip encoder of the SD1.5 ckpt for example) How hard would it be to train a model to do this if no such model exists?

Proposed workflow

Go to txt2img or img2img tab
Click the Image embedding checkbox
Witness the text prompt get replaced by a gradio image component 1.alt Same orthogonal process for negative prompt?

Additional information

No response

ljleb commented 1 year ago

FWIW, this is what chatGPT returned me:

image_path = r"C:\path\to\image.png"

import torch
from PIL import Image
from torchvision.transforms.functional import resize
from transformers import CLIPProcessor, CLIPModel
import numpy as np

image = Image.open(image_path)
image = resize(image, size=(512, 512))
image = np.array(image)

# convert PIL image to numpy array and normalize
image = [torch.tensor(image).permute(2, 0, 1).float() / 255.0]
  # shape: (1, C, H, W)
processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")
text = ["image"]
input_dict = processor(text=text, images=image, return_tensors="pt")

# pass the image through the model
model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14")
outputs = model(**input_dict)
# print the embeddings
print(outputs.vision_model_output.last_hidden_state.shape)

But this script returns torch.Size([1, 257, 1024]). The embed has shape 1024 instead of 768, so I'm not sure this is going to work.

missionfloyd commented 1 year ago

Isn't this basically what the Interrogate CLIP button does?

Ehplodor commented 1 year ago

This is exactly what has been done here : sd finetune with text clip replaced by img clip : https://github.com/justinpinkney/stable-diffusion see also : https://github.com/AUTOMATIC1111/stable-diffusion-webui/discussions/4595#discussioncomment-4459952

ljleb commented 1 year ago

I'm not sure this will ever get implemented so I'll try to make an extension in the meantime. No idea if I'll get it to work, we'll see!

ClashSAN commented 1 year ago

check this out: https://github.com/laksjdjf/pfg-webui

Ehplodor commented 1 year ago

check this out: https://github.com/laksjdjf/pfg-webui

Oh wow ! Can't wait to try this. TY

ClashSAN commented 1 year ago

well, I think it only works with anime.

Ehplodor commented 1 year ago

Right, it seems capable at the moment of variations on the theme of "1girl" pictures, sampled from WD1.4 model (anime fine-tune of SD2.1 if i'm right) and probably to some extent usable with other SD2.1 models (as stated in https://huggingface.co/furusu/PFG). So real images (or simply not originating from SD2.1 models) migh give worse outputs.

ljleb commented 1 year ago

@Ehplodor This is exactly what has been done here : sd finetune with text clip replaced by img clip : https://github.com/justinpinkney/stable-diffusion see also : #4595 (reply in thread)

Can you link to the commit or script in that repo that implements this? I looked a bit but I can't find anywhere a call to a CLIP image model that encodes an arbitrary image into embeddings to condition the sampler with it.

@ClashSAN check this out: https://github.com/laksjdjf/pfg-webui

This seems to be doing what I want, but I'm not sure it's doing it the way I was thinking of. I'm not sure why the custom PFG-WD model needs to be selected as if it was a checkpoint. Ideally, the pipeline just returns an embedding from an image that encodes its key features, and any arbitrary model can be conditioned on it.

As far as I understand, there does not seem to be a CLIP image embedder that matches the latent space of the text encoder used by stable diffusion. Does anyone know whether there exists a pretrain that could be finetuned on a custom database of text encoder output <-> image generated by sd? To finetune such a model, the only thing we need would be a prompt database and GPUs to generate images from the prompts, and then train/finetune the model.

ljleb commented 1 year ago

Ah this is the line: https://github.com/justinpinkney/stable-diffusion/blob/4ac995b6f663b74dfe65400285e193d4167d259c/scripts/image_variations.py#L126

After looking at the model card for the model this script uses, it seems that this is an entire different model from stable diffusion. It has been finetuned to accept image conditionings as input. I wonder if we can just use add difference any_sd_1.x + image_sd_1.4 - sd_1.4?

Even if there's a way to merge other models with this model, I eventually wanted to be able to combine text and image conditionings using linear equations, so maybe the solution will be to look into finetuning a custom model.

Ehplodor commented 1 year ago

Can you link to the commit or script in that repo that implements this? I looked a bit but I can't find anywhere a call to a CLIP image model that encodes an arbitrary image into embeddings to condition the sampler with it.

The model itself is modified so as to accept as input an image as well as text, and fine tuned for generating clip embedding for both. Image clip embedding is thus in realized at the same time as text clip embedding and is not a separate call.

@ClashSAN check this out: https://github.com/laksjdjf/pfg-webui

This seems to be doing what I want, but I'm not sure it's doing it the way I was thinking of. I'm not sure why the custom PFG-WD model needs to be selected as if it was a checkpoint.

As far as I understand this recent repository, a first model, a sort of "clip interrogate" model in my understanding, is used to extract image latent embedding from some obscure middle layer. Then, a second model, the Pfg-wd model, takes that as input (and text) and was fine-tuned for using this image embedding as input.

ljleb commented 1 year ago

I read the code again and it does indeed seem to be predicting conditionings from images. Thanks for sharing this!

PladsElsker commented 1 year ago

I'm wondering if training a model to translate directly from image embedding to text embedding would perform well. Generating data for this doesn't seem to be too hard to do.

For the dataset, you would need pairs of both types of embeddings. You could try using part of LAION's dataset, or make your own with websites like aibooru, and parsing each image and text into their respective embedding representations.

I really don't know much about what kind of model would perform well, if there are better alternatives to this, or if it's possible to train a model like this efficiently on a low-end gpu. This is just an idea.

ClashSAN commented 1 year ago

https://github.com/nagolinc/sd-image-variations-webui-plugin

previously someone was working on getting it running, it uses diffusers model https://huggingface.co/lambdalabs/sd-image-variations-diffusers

maybe using the mode will be useful if the weights are merged in add difference mode to another model.

this one is an improved version of the model: https://huggingface.co/lambdalabs/image-mixer

see older post - https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/5505

ljleb commented 1 year ago

New paper on the subject: https://arxiv.org/pdf/2302.13848.pdf

AUTOMATIC1111 / stable-diffusion-webui