Great work, couple of questions!

Luke2642 commented 1 year ago

Thanks for this repo, it's much eaiser to follow than the original google null text inversion in prompt to prompt! I haven't quite got yours working yet though.

Do you have any plans to make a colab free notebook not just a jupiter one for beginners like me? I've not got anything more than black images out when trying to convert the original to fp16 for colab free's 16gb t100. I've got a 3090 locally but not everyone has.
Do you think a ddim inversion can be saved out to a .pt file to used like a text embedding with uis like automatic1111's, or just as a kinda lossy image compression format? I know this has been done with the latents, but this could be even smaller? https://pub.towardsai.net/stable-diffusion-based-image-compresssion-6f1f0a399202

For requirements on colab free: I've been using old verisons to get it to work with the original. This combo works on colab free:

!pip install -U --pre triton torchinfo xformers==0.0.16rc425 diffusers==0.7.2 transformers==4.22.2 accelerate==0.12.0 ftfy

And this combo works too:

!pip install --quiet diffusers==0.8.0 !pip install --quiet https://github.com/brian6091/xformers-wheels/releases/download/0.0.15.dev0%2B4c06c79/xformers-0.0.15.dev0+4c06c79.d20221205-cp38-cp38-linux_x86_64.whl !pip install --quiet --upgrade transformers scipy mediapy accelerate ftfy spacy einops

cloneofsimo commented 1 year ago

Oh I never realized this repo got even tiny attention, thank you for reaching out. I'll be experimenting on this repo much often from now, so please expect more from me soon!

cloneofsimo commented 1 year ago

The problem is that i actually don't use A1111 repo that much so I have no idea what you mean. But as for the COLAB notebook, ill make one soon as I can!

Luke2642 commented 1 year ago

Fantastic, that’s great news, and thanks so much for the reply!

I’m still learning the basics, but I'm excited, I think it'll be so useful!

VQ-VAE reconstruction is relatively trivial and not that much use, but it could be good for image compression as that article I linked before.
Clip embedding is key to both image variations and image mixing but those both require a fine tuned model to generate from a clip embedding. It'd be great if future diffusion models were all trained this way, but that's not something we can influence much.
As I understand it, the precision of a textual inversion is limited by the second forward pass of classifier free guidance. And the more vectors are used the lower the editability - more vectors seems to only constrain the diffusion process more strongly to “paint” the same image across all seeds, rather than creating a semantic description like image variations / mixing.
Null text inversion does seem like a really interesting approach, solving for one seed, and possibly semantic enough to be useful across many seeds and editable. But it’ll need implementing in the major UIs to be creatively useful outside developers circles!
Another interesting approach posted recently - closer to the “holy grail” idea of being able to turn any image into a meaningful description in human language, that can then be turned back into the image:

https://arxiv.org/abs/2302.03668 https://github.com/YuxinWenRick/hard-prompts-made-easy https://huggingface.co/spaces/tomg-group-umd/pez-dispenser https://colab.research.google.com/drive/1VSFps4siwASXDwhK_o29dKA9COvTnG8A?usp=sharing

Anyway, long ramble, keep up the good work!

cloneofsimo / inversion_edits

Great work, couple of questions! #1