TheLastBen / fast-stable-diffusion

fast-stable-diffusion + DreamBooth
MIT License
7.52k stars 1.31k forks source link

Getting terrible results #1127

Open ariandemnika opened 1 year ago

ariandemnika commented 1 year ago

Hi im using these training code and im not even getting the person that i've trained on, it't completely different person. First im training the text encoder on 350 steps then 1500 steps the UNet.

    #train text encoder
    python diffusers/examples/dreambooth/train_dreambooth.py \
    --train_only_text_encoder \
    --image_captions_filename \
    --train_text_encoder \
    --dump_only_text_encoder \
    --pretrained_model_name_or_path="/home/ec2-user/sd/stable-diffusion-v1-5" \
    --instance_data_dir="/home/ec2-user/sd/nmoQItVi_cropped" \
    --output_dir="/home/ec2-user/sd/nmoQItVi_output" \
    --instance_prompt="a photo of nmoQItVi person" \
    --resolution=512 \
    --mixed_precision="fp16" \
    --train_batch_size=1 \
    --gradient_accumulation_steps=1 --gradient_checkpointing \
    --use_8bit_adam \
    --learning_rate=2e-6 \
    --lr_scheduler="polynomial" \
    --lr_warmup_steps=0 \
    --max_train_steps=350
  #train unet
    python diffusers/examples/dreambooth/train_dreambooth.py \
    --image_captions_filename \
    --train_only_unet \
    --save_starting_step=100000 \
    --save_n_steps=500 \
    --Session_dir="/home/ec2-user/sd/nmoQItVi_output" \
    --pretrained_model_name_or_path="/home/ec2-user/sd/stable-diffusion-v1-5" \
    --instance_data_dir="/home/ec2-user/sd/nmoQItVi_cropped" \
    --output_dir="/home/ec2-user/sd/nmoQItVi_output" \
    --instance_prompt="a photo of nmoQItVi person" \
    --resolution=512 \
    --mixed_precision="fp16" \
    --train_batch_size=1 \
    --gradient_accumulation_steps=1 --gradient_checkpointing \
    --use_8bit_adam \
    --learning_rate=2e-6 \
    --lr_scheduler="polynomial" \
    --lr_warmup_steps=0 \
    --max_train_steps=1500

I've copied the code on .ipynb and im using aws ec2 instance to train on.

TheLastBen commented 1 year ago

set the unet steps to 3000 and reduce the number of instance images, make sure you rename them correctly, set the unet learning rate to 3e-6

ariandemnika commented 1 year ago

@TheLastBen I've tried with 3000 steps it worked, but i got the same results with 1800 steps on lr=3e-6. im using only 10 instance images. the quality have been better before this is not the best what should i do more ?

This was generated before on google colab, before this last update. before

this is now on aws ec2 T4 GPU (used CodeFormer) after

TheLastBen commented 1 year ago

Try retraining the model with the new default settings lr 2e-5, 200 textenc, 650 unet

ariandemnika commented 1 year ago

@TheLastBen every image shows almost like original ones now 5129991595

TheLastBen commented 1 year ago

how many total steps ? did you go over 1000 ?

and don't resume training, start over the training with 10 instance images, 350 text encoder 1e-6 and 600-800 unet 2e-5

ariandemnika commented 1 year ago

@TheLastBen 200 steps on text encoder, 650 steps on unet, total 850 steps

TheLastBen commented 1 year ago

redo the training and set the unet steps this time to 400, keep the other settings like before, and test it

ariandemnika commented 1 year ago

@TheLastBen face not like the trained images it's completely different Unet

steps: 400 lr: 2e-5

Text encoder:

steps: 350 lr: 1e-6

TheLastBen commented 1 year ago

then resume the unet training for 100 more steps and so on ...

StiffPvtParts commented 1 year ago

I'm having similar issues. I've been unable to match previous results using the latest version of Dreambooth.

ariandemnika commented 1 year ago

me too i can't match the results! @TheLastBen how can i clone the October repo ?

TheLastBen commented 1 year ago

Because the learning rate was increased, so now the unet steps should not go over 1000 for 15 or less instance images, keep the steps low and slowly add 100 at a time.

StiffPvtParts commented 1 year ago

Because the learning rate was increased, so now the unet steps should not go over 1000 for 15 or less instance images, keep the steps low and slowly add 100 at a time.

Thank you for your answer.

Is there a place where I can read more about these changes and/or the process of training models?

TheLastBen commented 1 year ago

you can search in this repo discussions and issue, there are a lot of topic regarding the training

Quark999 commented 1 year ago

I also struggle achieving the results I had before - the best were when there was a percentage for the encoder and when I still had to specify woman/man. I haven’t seen an explanation as to why that is no longer needed, and what has changed. Is there a way to emulate the old behaviour - should I be using tags instead? I also didn’t find much of a description of them. And lastly, what impact does the “increased learning rate” have, and where would I manually set it, why, and to what?

ariandemnika commented 1 year ago

@TheLastBen do we have to do prompt engineering, or just give it simple prompts ?

TheLastBen commented 1 year ago

@Quark999 increasing the learning rate speeds up the training, after doing some tests, I found that 2e-5 is just below the limit, so with that setting, you can train under 15 minutes and get even better results than before.

TheLastBen commented 1 year ago

@ariandemnika You can use simple prompts, but not too simple, add "movie still" to the prompt and "cinematic" to help with the quality.

juanamd commented 1 year ago

I also struggle achieving the results I had before - the best were when there was a percentage for the encoder and when I still had to specify woman/man. I haven’t seen an explanation as to why that is no longer needed, and what has changed. Is there a way to emulate the old behaviour - should I be using tags instead? I also didn’t find much of a description of them. And lastly, what impact does the “increased learning rate” have, and where would I manually set it, why, and to what?

Same here. It would be super helpful if anyone could provide the learning rate for both the unet and text encoder that was used before they were added as settings in the colab. That way we can have them as references to compare with the new values. Also, I was previously using ~50 instance images with good results. With these new settings should I be using fewer images? Are regularization images recommended for a single person training? Afaik, they were used before (with the woman/man setting) and I got good results, but idk if it had anything to do with that or with the learning rate settings. Thanks!

TheLastBen commented 1 year ago

with the new settings, stick to 10 images per instance and around 600-800 unet steps per instance, and total text_enc steps to 400.

no need for regularization

ariandemnika commented 1 year ago

@TheLastBen I've trained text_enc ( learning_rate=1e-6, max_train_steps=350, lr_scheduler="polynomial" ) UNet - ( learning_rate=2e-5, max_train_steps=650, lr_scheduler="polynomial" ) used 10 images. I've tested on kylie jenner and this is the best result i've got, and it doesn't look like her as you can se below, I've tested up to 1000 unet steps didn't work. I don't know what's happening!

4810357586

prompt="Vaporwave portrait of tpUJoQGb person, realistic portrait, pinkish vaporwave colors, vibrant, purple neon colors, gradients, symmetrical highly detailed, digital painting, arstation, concept art, smooth, sharp focus, illustration, cinematic lighting art by Artgerm and Greg Turkowski and Alphonse Mucha",
num_inference_steps=50,
guidance_scale=7,
width=512,
height=512
Quark999 commented 1 year ago

With just 10 images for an instance, how easy is it to capture face, mid-body, and full body including feet, and from different angles? I find when collecting images I end up with more just to cover the basics. If I had say 30 images, I suspect my old rule of thumb of 100 steps per extra image no longer works - but what would I do if I did want to train on those extra images?

TheLastBen commented 1 year ago

@ariandemnika don't add "person" in the prompt, you are reducing the weight of the trained subject

tpcdaz commented 1 year ago

Yep same here. the new 10 instance 650 settings produce laughable results unfortunately. I wish there was a copy of the previous 3000 2e-6 colab we could still use as that was perfect and worked every single time

ariandemnika commented 1 year ago

@ariandemnika don't add "person" in the prompt, you are reducing the weight of the trained subject

@TheLastBen this just made it worse, you mean the UNet training instance_prompt argument ?

TheLastBen commented 1 year ago

Don't use "person" or any similar word in the inference prompt and in the instance name/prompt

ariandemnika commented 1 year ago

@TheLastBen still not getting anything better.

TheLastBen commented 1 year ago

@TheLastBen still not getting anything better.

I added back the previous settings

JuanIrache commented 1 year ago

For me, the current version needs about 350 unet steps to produce similar results to the previous one with 1000. In case this helps someone

ariandemnika commented 1 year ago

@TheLastBen still not getting anything better.

I added back the previous settings

@TheLastBen I've added back to the old training sets and im not getting the results again, text encoder: lr: 1e-6 350 or 900 steps, unet: lr: 2e-6 and 1000 or 1800 steps with 10images.

TheLastBen commented 1 year ago

so the problem isn't with the settings I changed, it's with your instance images, since everything was the same as before

ariandemnika commented 1 year ago

@TheLastBen the training is good now using 250 steps for text encoder with lr: 1e-6, and UNet 650 steps with lr: 1e-5 using 6 images, but i guess my generating code has something wrong! im using the trained model on google colab and i get amazing results!! Im using this code to generate images of trained model:

    scheduler = DDIMScheduler.from_pretrained(model_id, subfolder="scheduler")
    pipe = StableDiffusionPipeline.from_pretrained(model_id, scheduler=scheduler, torch_dtype=torch.float16).to("cuda")
    pipe.safety_checker = lambda images, clip_input: (images, False)
    pipe.enable_attention_slicing()
        img_name = random.randint(999999999,9999999999)
        image = pipe(
                    prompt=prompt,
                    negative_prompt=negative_prompt,
                    num_inference_steps=num_inference_steps,
                    guidance_scale=guidance_scale,
                    width=width,
                    height=height
                ).images[0]
TheLastBen commented 1 year ago

why aren't you using the A1111 to generate ? and for 6 images, you can set the text_encoder steps to 50-100 for less overfitting

ariandemnika commented 1 year ago

why aren't you using the A1111 to generate ?

and for 6 images, you can set the text_encoder steps to 50-100 for less overfitting

@TheLastBen Im not using A1111 because i've build an API endpoint to generate only on already existing prompts then delete the instance.

Okay i'll try decreasing it but the results are good on 250 steps.

TheLastBen commented 1 year ago

For a face you can keep it 250 for text_enc, but for a style, reduce it to allow flexibility

qudabear commented 3 months ago

with the new settings, stick to 10 images per instance and around 600-800 unet steps per instance, and total text_enc steps to 400.

no need for regularization

This is only good for face, not for bodies, half bodies, or styles. We need more info.