XavierXiao / Dreambooth-Stable-Diffusion

Implementation of Dreambooth (https://arxiv.org/abs/2208.12242) with Stable Diffusion
MIT License
7.6k stars 795 forks source link

This is gamechanging. Wow. #4

Open nikopueringer opened 2 years ago

nikopueringer commented 2 years ago

After watching experiments on Textual Inversion from the sidelines, I decided to jump in and try this. Thank goodness I had a A6000 on hand for the whopping 38 gigs of VRAM needed.

but the results are amazing. Far better than basic Textual Inversion. And it trains in a FRACTION of the time!

the following images are me as a zombie, using a basic prompt like “digital painting of yourNameHere as a zombie, undead, decomposing, concept art, by Greg Rutkowski and Wlop”

BB6DCE9F-1E97-4731-8143-ABD0E02F9AAD AE770398-D967-4562-800C-5C8920C33AB0 1F5B9D9C-7EE0-4741-A30B-AB5AACAC9F90

some of my initial findings:

Low CFG (I used 5) helps the image be more flexible in style and subject. Otherwise it kind of just looks like the iPhone selfies I trained it on.

You really have to push to bring in painter styles and subjects. I had to write “zombie, undead, decomposing” before it finally broke away from generating a normal face.

there are about a million variables I can think of to start tweaking. I would love to hear about other people’s experiences. What’s the best image set to train on? How many images? How many iterations? Can we make this run on a 3090?

I have a ton of questions and experiments to do. What do you guys think?

F5E8D41B-F618-44C5-8BCA-AED033C018DE

JoePenna commented 2 years ago

Some similar results here. Thank you for the help setting it up, Niko!

Training set: training set

Best I got with traditional textual encoding: training set training set

With fine-tuning: img img

And some prompts (older model that wasn't as faithful): img img img .

Some more info here (Stable discord).

ExponentialML commented 2 years ago

Nice results! I've been running some experiments as well. I've upped the class images from the suggested 8 to about 100. It seems to generalize better.

I almost got it to work on a 3090 by trying to train the model using DeepSpeed ZeRO-3 Offload, but it seems like there would be a bit more grunt work (casting tensors, conversions, model splitting, etc.) than just changing the training method.

XavierXiao commented 2 years ago

Wow. I am not familiar with how to write the best prompt for SD (or any text-to-image model), so even myself cannot generate the cool things you did. I try to train with some images of newfoundland dogs (my profile pic) and use exactly the same zombie prompt, and I also got some interesting results. Prompt is so important!

digital-painting-of-sks-black-dog-as-a-zombie,-undead,-decomposing,-concept-art,-by-Greg-Rutkowski-and-Wlop-0374

nikopueringer commented 2 years ago

Ha, those dog pictures are amazing! That’s so awesome.

truly, the dreambooth method implemented here is next level. It’s the beginning of a huge technological shift, and a solution to one of the key problems preventing image generators from being a useful tool. I hope development continues!

robertsehlke commented 2 years ago

Second this, I've been getting some great results as well. Customization and workflow integration will be the killer features of open source image generation models.

I wonder if there is an efficient way to save and use the newly learned concepts, similar to the embedding files produced by textual inversion (which I realize wouldn't be directly applicable here).

khronimo commented 2 years ago

Nice results! I've been running some experiments as well. I've upped the class images from the suggested 8 to about 100. It seems to generalize better.

I almost got it to work on a 3090 by trying to train the model using DeepSpeed ZeRO-3 Offload, but it seems like there would be a bit more grunt work (casting tensors, conversions, model splitting, etc.) than just changing the training method.

It would be absolutely amazing to see this running on a 3090.

ExponentialML commented 2 years ago

Also just to confirm in case someone is wondering, I haven't had any luck fine-tuning multiple concepts on the newly trained models. So if you train a concept, then train another concept on that newly trained model, it will combine them, even with a different identifier.

I haven't checked if the identifier is linked to the seed or not, but I will when I get the chance. Hopefully it's possible because my disk space only has so much 🙂.

robertsehlke commented 2 years ago

Also just to confirm in case someone is wondering, I haven't had any luck fine-tuning multiple concepts on the newly trained models. So if you train a concept, then train another concept on that newly trained model, it will combine them, even with a different identifier.

That's curious - I naively trained a different concept (also different class id) on top of the first model I fine-tuned and that worked out pretty well. The model can generate images from both concepts individually, though it tends to blend them if used together in a prompt.

Edit: I did notice one thing that may be related: running the prompt that generates the regularization images with the first- and second-generation fine-tuned models visibly degrades/collapses the results. Will make a separate issue.

XavierXiao commented 2 years ago

Agree with ExponentialML. We should generate more reg images, 8 seems to be too small. I have updated the readme.

1blackbar commented 2 years ago

WEll i had different approach about this, as you know sly stallone is looking crap in SD so.. i overfit his face with SD a lot, and i laso trained his regular so-so likeness default setting embedding, i use them both, the worse likeness one is to change a style, and the high llikeness one overfit embedding is to bring back his likeness So, this is regular tex inversion, you can make it work if you really want to . My take on is is that you can change a style with heavy overfitting but its harder but you cant do caricatured styles too much, more like classic painters or comic book styles that dont distort faces too much. It takes hela GPU to train the repo were on , im sure the code works much better as well, but regular texinversion works pretty well if youre willing to go extra steps
Some outputs from sd with texture inversion ( not dreambooth) Advantage of this is filesize !!! This is huge deal for me at least vs dreambooth way. image

image image image image image

image

image

1blackbar commented 2 years ago

OK You might say but hes already in the SD ckpt file , right ? Yeah he is so heres my mom when she was 20yrs old, regular texinversion as well Overal what would impress me the most would be a code that would let us finetune robocop with his suit and resynthesize the suit perfectly, i dont hink SD architecture is capable tho.There is ironman suit trained nicely in SD but all of them are mutations of the suits but its not so jarring cause movies had a lot of revisions. Cant wait for an episode on SD ! c9 image image image image image 01442-2271430690-photo_of_cela 01427-3146276753-photo_of_cela

hopibel commented 2 years ago

Gamechanging or not, it's unlikely to catch on if hobbyists can't play around with it, so the fact that this technique requires a workstation class gpu is a huge downside :P

burgalon commented 2 years ago

@nikopueringer what did you use for regularization on your own photo training?

Desm0nt commented 2 years ago

Gamechanging or not, it's unlikely to catch on if hobbyists can't play around with it, so the fact that this technique requires a workstation class gpu is a huge downside :P

It's require ~1 hour (for training and download tuned model) of Tesla A40/A6000 on Vast.ai with 0.4-0.8$ investment. It's really cheap. And then you can run this model in any colab or any local repo on 6+ Gb VRAM as you usual do with original model.

hopibel commented 2 years ago

@Desm0nt The saying that comes to mind is "Beware trivial inconveniences". By which I mean adoption of two competing things is heavily influenced by which one is more accessible.

Being able to set up the repo on a cloud compute service is a nontrivial amount of additional time and know-how needed compared to clicking "run all cells" in a colab notebook. Not to mention (correct me if I'm wrong) this technique involves modifying the weights of the model itself, while textual inversion produces embeddings that can be shared independently of the model, which makes it possible for the community to build repositories of object and style "plugins" like we see with huggingface's sd-concepts. I'd argue the collaborative nature of sd-concepts has the potential to be an even bigger gamechanger.

Desm0nt commented 2 years ago

@hopibel Jupiter notebook on vast.ai works the same way as in colab =)

Textual inversion is good and really very comfortable compare to this solution, but it actually can't add knowledge of an object to the model. It's just add non-textual description that can produce a similar to required object thing with use of existing model knowledge. If model don't have enough knowledge to produce it accuratly - it will be slightly (or very much) different from required object. And with 3-5 samples it's create a description, that try to produce exactly the same object as on samples.

But with Dreambooth you actually fine tune the model. With 100+ samples the model actually learns information about the object in different conditions (not just a description of something similar), which allows it to recreate exactly the object we are looking for (if we want - even as photorealistically close to the images from the training sample as possible). This makes it possible to fully stylize the object and substitute it in different conditions without fear of losing features and similarity.

The difference between Dreambooth and textual inversion as the difference between the real knowledge in the artist's style in the model (which allows you to apply it to any query) and the handpicked combinations of descriptions that give an apparently similar style under certain conditions, but lose similarity under other.

It's a real true finetune of the model, but it doesn't require such huge amounts of data and takes a lot less time to learn new concepts without the risk of screwing up the generalization of the model.

The only significant drawback - the resulting model weighs 12.4 gb instead of the 4 gb of the original model, and my knowledge is not enough to somehow compress it before downloading from the cloud machine.

ThereforeGames commented 2 years ago

@Desm0nt Interesting, I was wondering how Dreambooth would fare with general style transfer. The Google page advertises it as "subject-driven generation" but, based on your comment, it should outperform Textual Inversion even at abstract tasks.

Right now I'm particularly interested in seeing if it's possible to create good pixel art in SD via img2img. It's already somewhat possible with complex prompt engineering, but the results are inconsistent.

I've tried finetuning a collection of 14 pixelized character portraits in textual inversion, but even after 50k iterations, the style transfer is a complete mess.

Any successful examples of something like this in Dreambooth? Doesn't have to be pixel art.

TingTingin commented 2 years ago

I naively trained a different concept (also different class id) on top of the first model I fine-tuned

What was the size of the model at the end of both training runs?

nikopueringer commented 2 years ago

@nikopueringer what did you use for regularization on your own photo training?

Just 10 pictures or so generated from “man” as my prompt

1blackbar commented 2 years ago

For more examples you can read here about oinversion guys, dont want to hijack this repo as its something else but prettyu much doing similar thing https://github.com/rinongal/textual_inversion/issues/35 image image

image image image

TingTingin commented 2 years ago

I believe inversion is better at styles and this is better at subjects

dboshardy commented 2 years ago

I believe inversion is better at styles and this is better at subjects

Are there any side-by-sides to assess that yet? I do agree, textual-inversion does achieve some pretty incredible style transfer. But as @1blackbar demonstrated, it can do subjects pretty well too.

1blackbar commented 2 years ago

regular inversion needs embedding to be put late in a prompt if you finetuned with overfitting, and i did, that sly is 60 vectors, so prompt was - "dawn of the dead comics undead decomposed bloody zombie , painting by greg rutkowski, by wlop by artstation zombie portrait of zombie slyf as a zombie" Embedding is a word "slyf" i use AUTOMATIC1111 repo for all this and nicolai25 repo for inversion. I keep 2 embeddings like i wrote =- one is average likeness but great stylisation, other one is this one you see - great likeness and harder to obtain stylisation but definitely possible , as for inversion being better at something or not - no proof , not matter, anyway i ddont have a gpu to try out this fork but entire 4gb file for getting one likeness is a stretch for me so id rather wait for some faster solution like embeddings that have like 20kb in size, im building a library of fixed/repaired subjects

Maki9009 commented 2 years ago

i dont know mine fails at OSError: cannot open resource... it always fails on 59% after it already did a first 100%... so like 511 samples and it crashes after that. i hope it works though, currently downloading the model maybe it worked maybe not

nikopueringer commented 2 years ago

i dont know mine fails at OSError: cannot open resource... it always fails on 59% after it already did a first 100%... so like 511 samples and it crashes after that. i hope it works though, currently downloading the model maybe it worked maybe not

If it’s crashing at 500 samples you’re probably encountering the missing font file. Just replace it and edit the .py file that calls for it to use whatever font you replaced it with

Maki9009 commented 2 years ago

i dont know mine fails at OSError: cannot open resource... it always fails on 59% after it already did a first 100%... so like 511 samples and it crashes after that. i hope it works though, currently downloading the model maybe it worked maybe not

If it’s crashing at 500 samples you’re probably encountering the missing font file. Just replace it and edit the .py file that calls for it to use whatever font you replaced it with

yup thank you, I figured that out eventually, just last question though... once I've trained the model. Do I have to still run it on a GPU that has 40GB of VRAM or can I run this model now locally if I want? or on free colab? i'd like to test it out on defuorm diffusion.

nikopueringer commented 2 years ago

It will run with the same VRAM requirements as the regular model!

nicolai256 commented 2 years ago

Nice results! I've been running some experiments as well. I've upped the class images from the suggested 8 to about 100. It seems to generalize better.

I almost got it to work on a 3090 by trying to train the model using DeepSpeed ZeRO-3 Offload, but it seems like there would be a bit more grunt work (casting tensors, conversions, model splitting, etc.) than just changing the training method.

maybe using this method on top of yours might be good for distribution vram without errors, with this method i can generate 2304x2304px images on my 3090, not sure if only pasting those files might help dreambooth but maybe it could? it will probably be another file u'll have to apply this method to but it seems like it could work.. https://drive.google.com/drive/folders/1lqcWpHBHV_UAlaPtdfaSVwisdb2uGT8x?usp=sharing if you could make a repo of your efforts i could try giving it a shot :)

Oscerlot commented 2 years ago

I managed to do a training sesh on vast.ai for ~$1, but since this looks like the hip-thread where all the cool kids are coming, I figure this might be the best spot to just post a quick few steps here for anyone who does come across this to have a bit of direction on how to go about it if you're clueless to get it going, like me (Not a full tutorial, but more like a look here to start and here are some issues and how to fix!):

I might have gotten a couple things wrong, this was mostly from memory and notes of my trial-and-error approach, so if it doesn't work, I probably screwed something up in writing this, but hopefully it puts you on the right track if you get stuck. Or you know, don't find or do this at all, so the pricing of the instances don't go up and I can keep training for cheap 👀

prettydeep commented 2 years ago

Do the training and regularization images need to be 512x512, like the original SD model is based on?

prettydeep commented 2 years ago

Just a follow-up to the guide provided by @Oscerlot ...

If using Vast.ai, make sure to get an instance with pytorch, A6000 gpu, and 100GB drive space (in case you want to generate multiple models).

1) I used ~100 training images and ~300 regularization images. All were 512x512.

2) I modified the /Dreambooth-Stable-Diffusion/configs/stable-diffusion/v1-finetune_unfrozen.yaml file to allow for 4000 steps (line 120). The main.py saves a last.ckpt every 500 steps, so for each save I just moved last.ckpt to <#steps>.ckpt so I could try multiple models.

3) I added "--no-test" to the main.py training command to prevent this issue.

4) Before creating the conda env, I did: a) apt -y install gcc (this was missing on my first attempt, and caused a failed pip install) b) conda update -n base conda (just to make sure conda is up to date)

5) I ran the following within the Dreambooth directory to prevent the font error referenced above by Niko:

mkdir data/
wget https://github.com/prawnpdf/prawn/raw/master/data/fonts/DejaVuSans.ttf -P data/

6) Training ran well. But I could not directly download the ckpt file (~12GB) for some reason, so I moved the ckpt file to the root directory and downloaded it via scp. To set up ssh for scp just follow the directions provided by vastai when you click the terminal button on your specific instance. EDIT 9/20: Before downloading, you can prune your ckpt down to 2-3GB by using this script.

On your instance...

mv /workspace/Dreambooth-Stable-Diffusion/logs/<training folder>/checkpoints/last.ckpt /workspace/Dreambooth-Stable-Diffusion/last.ckpt

then on your local machine...

scp -P <port#> root@ssh6.vast.ai:/workspace/Dreambooth-Stable-Diffusion/last.ckpt <local folder>

Below are some sample output images.

Thoughts: 1) The results highly depend on using a prompt with "a photo of sks man" rather than "sks man" alone. 2) Separating "a photo of sks man" by a comma from the rest of your prompt tends to produce better results. 3) If your training images are only headshot selfies, which mine were, then producing wide shot images have a poor likeness quality. 4) I tested multiple models, and the difference beyond 1000 steps is negligible. 5) I found that the lower the guidance value was, the better the image style but the lower the likeness. I found a nice balance at ~7. Although, I could never get a good zombie result using Niko's prompt regardless of the guidance value. Maybe Niko has a natural zombieness about him :)

1 2 3 4 5

Luvata commented 2 years ago

Can some of you share the trained Dreambooth checkpoint for me? I'm gonna run some tests on the finetuned weights but my current hardware is not feasible to run the current training

martin-888 commented 2 years ago

I managed to do a training sesh on vast.ai for ~$1, but since this looks like the hip-thread where all the cool kids are coming, I figure this might be the best spot to just post a quick few steps here for anyone who does come across this to have a bit of direction on how to go about it if you're clueless to get it going, like me (Not a full tutorial, but more like a look here to start and here are some issues and how to fix!):

  • prep your training and regularization data in advance
  • pick an instance with at least 1 A6000 (cheapest that meets the VRAM reqs, I've found - and 1 is good to start with since you might be spending more time figuring out how to set it up than actually training it). Make sure the download (and upload) speeds are decent, like >100mbps ...

I'm following your exact steps but 1 epoch on A6000 takes ~75s, is there a param to tweak or what can be wrong?

EDIT: 1 epoch = 100 iterations ~ 1s per iteration => training is done in less then 15min!

burgalon commented 2 years ago

We're building a service which easily allows fine-tuning and generation of images from the new fine-tuned model. Please join our discord to further discuss fine-tune and DM me if you'd like an invite https://discord.gg/mp6QRuNN

Randy-H0 commented 2 years ago

Hi there! I would love to know how to run this on this site called Google Colab. I only have the Pro tier on it, which allows P100 to appear in the wild and for me to grab them, they aren't that high end, about 16 gb of Vram,

shahruk10 commented 2 years ago

Got decent results with mixed precision training on a NVIDIA RTX 3090 (changes made can be seen here)

It would be interesting to see some experiments with partially unfrozen models such that the finetuning can be done on GPUs with < 24 GB VRAM ... It feels like unfreezing only the the input block of the U-Net architecture (the left | in | _ | ) could work in adding new subjects since that is primarily responsible for encoding the inputs.

shahrukhossain person as a portrait by Yoji Shinkawa in the style of Metal Gear Solid_3

shahrukhossain person as an anime character_0

shahrukhossain person as a portrait by Greg Rutkowski_0

shahrukhossain person selfie on top of mount everest_0

ShivamShrirao commented 2 years ago

Happy to share that it now runs in just 18 GB VRAM and even 2 times faster. Hoping to get below 16 GB in a day or two.

More details here: https://github.com/XavierXiao/Dreambooth-Stable-Diffusion/issues/35

Screenshot_20220927_042425