CrossAttentionControl, Dreamfields-3D and Dreambooth implementation

Antoinevdlb commented 1 year ago

Hey! I found these different Stable Diffusion library that I would love to see integrated on this repo.

Curious if anyone else would also enjoy to use them

Cross Attention Control allows much finer control of the prompt by modifying the internal attention maps of the diffusion model during inference without the need for the user to input a mask and does so with minimal performance penalities (compared to clip guidance) and no additional training or fine-tuning of the diffusion model.

DreamFields 3D by shengyu-meng

A toolkit to generate 3D mesh model / video / NeRF instance / multiview images of colourful 3D objects by text and image prompts input

DreamBooth

DreamBooth is a method to personalize text2image models like stable diffusion given just a few(3~5) images of a subject. The train_dreambooth.py script shows how to implement the training procedure and adapt it for stable diffusion.

All of these seem extremely useful, especially Dreambooth training.

Thanks for considering these!

Goldenkoron commented 1 year ago

I also very much want DreamBooth to become an integrated feature in the UI. Hopefully it can be made to work. I have an RTX 3090 so it should be possible to run.

LLjo commented 1 year ago

This would be awesome

xbox002000 commented 1 year ago

super cool

OWKenobi commented 1 year ago

+1 for DreamBooth!

StrangeCalibur commented 1 year ago

echo the +1 for dreambooth!

AmericanPresidentJimmyCarter commented 1 year ago

I was able to get it working with about ~13 GB of RAM in half precision, but not matter what my checkpoints appear to be corrupt. :( Not sure why torch.save is producing things I can not deserialize.

OWKenobi commented 1 year ago

Awesome! I have a graphics card that is capable of testing this, so if there is some kind of review process involved I could help in, please let me know.

I think it should be a separate TAB called Dreambooth - where you can select pictures from hard drive, select a destination for the final model, and then a button to process the images. Maybe a text field explaining the whole process, telling you the minimum requirements and the ETA. I think the UI would be fairly simple like that.

Also, we need an explanation of what the placeholder will be called. As far as I have read, prompts look like "a photo of a red [X]".

d8ahazard commented 1 year ago

Awesome! I have a graphics card that is capable of testing this, so if there is some kind of review process involved I could help in, please let me know.

I think it should be a separate TAB called Dreambooth - where you can select pictures from hard drive, select a destination for the final model, and then a button to process the images. Maybe a text field explaining the whole process, telling you the minimum requirements and the ETA. I think the UI would be fairly simple like that.

Also, we need an explanation of what the placeholder will be called. As far as I have read, prompts look like "a photo of a red [X]".

So, a few things I've learned thus far, although my results are still TBD.

One: You can totally run this on an GPU with only 8G of VRAM. At least, I can run it using the "optimized" repo, and then using the "accelerate config" option to tell it to run only in CPU. Is it slow as hell? Yes. Can I build a dreambooth model on my rig? Yes. :P

So, definitely need to figure out if we can somehow leverage "accelerate" for poor schlubs like me who haven't bought a new GPU in a few years.

For the placeholder/prompt, it's slightly different from Textural Inversion. With TI, we just use a prompt. "MrFluffy" for a dog or something. But with Dreambooth, it appears you also need a "class" parameter - so for MrFluffy, it would be "MrFluffy dog", and then the class prompt would be "dog". At least, I think that's the idea. Then to use it, you could do "a photo of MrFluffy dog in the mountains", or "a painting of MrFluffy dog", etc.

There's also a few unknowns I'm trying to experiment with...

How many pictures is ideal? I read somewhere today that a suggestion was 5 close-ups, 5 "half item" shots, and then 5-10 "full item" shots. Maybe more? Maybe less? Same goes with TI, actually.

There's also a "class image" parameter or something like that that conflicting sources say you can either fill via Text2Image, or by using a bunch of existing images of your class item, like dogs. Somewhere said ~100 or more is good for this...but IDK.

Another unknown is how creating the "new" checkpoint works with the script to convert from DreamBooth to SD. It relies on the existing SD ckpt file to build on - but the dreambooth training currently requires the diffuser files from the official SD checkpoint image. So, is there a way to extract the diffusers from a custom SD checkpoint, then use those? Or, even just extract them from the official checkpoint, as otherwise, my method requires cloning the official files from Huggingface, and that's never any fun. :P

Last - textural inversion saves an image every N steps, and a checkpoint every N steps. The version of DreamBooth I ported does not do this, but it would be super if it did, so I guess I need to research how that bit works.

But, at the very least, I've whipped up a python class that exposes the parameters needed to run this from within our little app. We just need the godz to decide how it should be implemented.

dezigns333 commented 1 year ago

I would love to see DreamFields or DreamFields with Stable Diffusion or DreamFusion one day.

https://dreamfusion3d.github.io/

dezigns333 commented 1 year ago

A proper version for an AMD GPU for stable diffusion webui would also be very popular.

0xdevalias commented 1 year ago

Dreambooth is available as an extension now, see below:

Closing this, as I've now started a repo with a standalone extension based on ShivShiram's repo here:

https://github.com/d8ahazard/sd_dreambooth_extension

Please feel free to test and yell at me there. I've added requirements installer, multiple concept training via JSON, and moved some bit about.

UI still needs fixing, some stuff broken there, but it should be able to train a model for now.

Originally posted by @d8ahazard in https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/3995#issuecomment-1304922541

PGadoury commented 1 year ago

Seems like #2725 and #1825, which were closed as duplicate are focused solely on cross-attention-control, with specific suggestions toward implementation, rather than this issue which is somewhat conflating three requests and muddling the discussion. I understand not wanting to flood the board with duplicate issues, but since Dreambooth is implemented, wouldn't it make more sense to keep the two (three) enhancement tickets separate?

I.e, close this issue (since the discussion seems to be focused on dreambooth, for which an available extension already exists, i.e. https://github.com/d8ahazard/sd_dreambooth_extension), open a separate issue for Dreamfields (unless one already exists) and reopen #2275?

Of course, if the intention is to collect any and all enhancement tickets together so that possible developers can find them on a single ticket, then disregard this comment.

AUTOMATIC1111 / stable-diffusion-webui

CrossAttentionControl, Dreamfields-3D and Dreambooth implementation #1280