Add generation based on CLIP image embedding

harubaru commented 2 years ago

This adds support for the notebook to generate variations of images based on image input. It's a little bit hacky right now as I am unfamiliar with using Colab and Notebooks in general, but this is a start! 😅

Veldrovive commented 2 years ago

Hey Haru. Great work on the variation generation!

There seems to be an issue with using variable size images because of the clip encoding step. You can fix that by just copying the clip preprocessing done usually. So a loop would look like this

for name, file_info in file_input.value.items():
  image_pil = Image.open(io.BytesIO(file_info['content']))
  min_side = min(image_pil.size)
  transforms = T.Compose([
    T.ToTensor(),
    T.CenterCrop(min_side),
    T.Resize(clip.image_size)
  ])
  image_tensor = transforms(image_pil).unsqueeze_(0).to(device)
  unbatched_image_embed, _ = clip.embed_image(image_tensor)
  image_embed = torch.zeros(text_rep, unbatched_image_embed.shape[-1])

Also, the way the image grouping works, the number of images must match the number of prompts so I would ask you to put something like this in to enforce that

num_images = len(file_input.value.items())
num_prompts = len(prompts)
assert num_images == num_prompts, "Each uploaded image must have an associated prompt"

Finally, could you make sure to clear output before committing? I forgot to do that the first time and it makes diffs go crazy so it is difficult to see what has changed.

harubaru commented 2 years ago

Noted! Thanks for the feedback, I'm going to see if I can get to that soon.

LAION-AI / dalle2-laion

Add generation based on CLIP image embedding #5