save_weights doesn't seem to save the finetuned model, plus problems with merging latest from diffuers

David-Hari commented 1 year ago

Describe the bug

I tried running train_dreambooth.py on some of my own images and prompt, but each sample image was identical to the initial model (so for the prompt "a photo of xzv" I just got random images instead of the "xzv" subject I was training it to recognise). That could have just been because the learning rate was too low, but when I compared each snapshot they were binary identical, indicating that nothing in the neural network had changed. Could someone explain how the save_weights works and, assuming it works fine for everyone else, what I am doing wrong?

I also tried to update the script by merging the more recent changes from the original diffusers train_dreambooth.py. This was because the newest version can do checkpointing, but it does it in a different way to this one. I got it to run but I'm not sure if it works. When I generate sample images they are all completely black.

Here is my changes. I've still got some TODOs in there, so you can see the bits where I was not sure would work, but it should at least run.

Perhaps the problem of blank image is due to the loss going to nan at some point during training. I'm not sure what causes that, perhaps the learning rate is too high. When I reduced to 1e6 I was able to get at least 100 steps, but eventually it still does to nan. See the attached log of the model_pred and target tensors at the line loss = F.mse_loss(model_pred.float(), target.float(), reduction="mean")

Reproduction

Run the script linked in the description, using the following command line arguments:

--pretrained_model_name_or_path "CompVis/stable-diffusion-v1-4"
--revision "fp16"
--output_dir "<output folder on my computer>"
--instance_data_dir "<images folder on my computer>"
--instance_prompt "a photo of xzv"
--save_sample_prompt "a photo of xzv"
--seed 3434554
--resolution 512
--train_batch_size 1
--mixed_precision "fp16"
--use_8bit_adam    <-- Possible on Windows due to https://github.com/DeXtmL/bitsandbytes-win-prebuilt
--learning_rate 5e-6
--gradient_accumulation_steps 1
--gradient_checkpointing
--lr_scheduler "constant"
--lr_warmup_steps 0
--num_class_images 50
--sample_batch_size 4
--max_train_steps 200
--checkpointing_steps 100

I used about 20 sample images.

Logs

Steps:  36%|███▌      | 71/200 [00:29<00:48,  2.66it/s, loss=0.206, lr=5e-6]
model_pred = tensor([[[
          [-0.8237,  0.6436,  1.0195,  ..., -0.4800,  0.9629,  0.1987],
          [ 0.3696,  0.0919, -0.2876,  ..., -1.0273, -0.5376, -0.4285],
          [-0.1575, -0.4956, -0.5435,  ...,  0.6250, -0.3247, -0.3101],
          ...,
          [ 0.6909,  0.7163, -0.2115,  ..., -0.9668, -0.3865,  0.4448],
          [-0.4167,  0.4500,  0.8555,  ...,  0.0676, -0.1455,  1.2188],
          [-0.2559, -0.8154,  0.0177,  ...,  0.4170,  0.3999, -0.5459]]]],
       device='cuda:0', grad_fn=<ToCopyBackward0>)
target = tensor([[[
          [-0.6733,  0.9438,  0.6919,  ..., -0.3840,  0.4272,  1.9414],
          [ 1.0498,  0.3989,  0.3013,  ..., -1.9463,  0.5469, -1.2148],
          [-0.2477, -0.8833, -1.9746,  ...,  0.5469,  1.9707,  0.5532],
          ...,
          [ 0.5244,  1.1689, -0.5732,  ..., -1.6299, -0.3330,  0.4656],
          [-0.7290,  0.9502,  0.3123,  ...,  0.2981,  1.2314,  1.5303],
          [-0.2659, -0.4954, -0.5200,  ...,  0.0591, -0.1464,  0.0805]]]],
       device='cuda:0', dtype=torch.float16)

Steps:  36%|███▌      | 72/200 [00:29<00:47,  2.70it/s, loss=0.206, lr=5e-6]
model_pred = tensor([[[
          [ 0.1188,  0.3740,  0.3113,  ...,  0.5493,  0.5493, -0.2810],
          [-0.7388,  0.4385, -0.3315,  ...,  0.3916,  1.1250, -0.2725],
          [ 0.5371,  0.3010,  0.1044,  ...,  0.5347, -0.6611, -0.0994],
          ...,
          [ 0.5615, -0.4299, -0.7407,  ..., -0.1791, -0.3521, -0.8721],
          [ 0.7563,  0.3669,  0.2042,  ..., -0.1681,  0.5630, -0.5942],
          [-1.0850, -0.8940, -0.1921,  ...,  0.0477, -0.0148, -0.0716]]]],
       device='cuda:0', grad_fn=<ToCopyBackward0>)
target = tensor([[[
          [ 1.5879e+00,  4.8975e-01,  2.1936e-01,  ...,  1.2139e+00, 1.2402e+00, -6.9482e-01],
          [-8.4814e-01, -1.8896e+00,  9.5654e-01,  ...,  8.6084e-01, 1.4463e+00,  7.7588e-01],
          [ 7.5830e-01,  4.9292e-01,  8.8916e-01,  ...,  1.2035e-03, -1.6772e-01, -4.0894e-01],
          ...,
          [-2.0820e+00,  3.0591e-01, -3.4937e-01,  ...,  1.7346e-01, -1.1807e+00, -1.3477e+00],
          [ 8.9209e-01,  5.6580e-02,  1.1104e+00,  ...,  7.0508e-01, 1.1436e+00,  7.6477e-02],
          [-1.6006e+00,  3.5986e-01, -1.1299e+00,  ..., -2.1350e-01, -1.2188e+00, -9.2578e-01]]]],
       device='cuda:0', dtype=torch.float16)

Steps:  36%|███       | 73/200 [00:30<00:48,  2.61it/s, loss=0.206, lr=5e-6]
model_pred = tensor([[[
          [nan, nan, nan,  ..., nan, nan, nan],
          [nan, nan, nan,  ..., nan, nan, nan],
          [nan, nan, nan,  ..., nan, nan, nan],
          ...,
          [nan, nan, nan,  ..., nan, nan, nan],
          [nan, nan, nan,  ..., nan, nan, nan],
          [nan, nan, nan,  ..., nan, nan, nan]]]],
       device='cuda:0', grad_fn=<ToCopyBackward0>)
target = tensor([[[
          [-8.3936e-01, -3.0859e-01,  9.0234e-01,  ...,  1.5527e+00, 7.9150e-01, -1.1426e+00],
          [-4.8779e-01,  1.0518e+00,  7.3047e-01,  ...,  1.0029e+00, 5.6641e-01, -3.6133e-01],
          [ 3.8794e-01, -4.9121e-01,  8.2324e-01,  ..., -2.0352e+00, -7.2559e-01,  1.4229e+00],
           -5.5029e-01, -8.7646e-01],
          ...,
          [-1.0908e+00,  8.3545e-01, -2.5508e+00,  ...,  1.5889e+00, 1.3652e+00,  1.4922e+00],
          [ 6.2305e-01,  5.4248e-01,  9.0381e-01,  ...,  6.5332e-01, 6.5576e-01,  1.4917e-01],
          [-4.6509e-02, -8.7280e-02,  5.7129e-01,  ..., -2.3535e+00, -3.5791e-01, -7.7930e-01]]]],
       device='cuda:0', dtype=torch.float16)

Steps:  37%|███       | 74/200 [00:30<00:47,  2.64it/s, loss=0.206, lr=5e-6]
model_pred = tensor([[[
          [nan, nan, nan,  ..., nan, nan, nan],
          [nan, nan, nan,  ..., nan, nan, nan],
          [nan, nan, nan,  ..., nan, nan, nan],
          ...,
          [nan, nan, nan,  ..., nan, nan, nan],
          [nan, nan, nan,  ..., nan, nan, nan],
          [nan, nan, nan,  ..., nan, nan, nan]]]],
       device='cuda:0', grad_fn=<ToCopyBackward0>)
target = tensor([[[
          [ 0.8452,  0.5044,  0.4785,  ...,  0.8799, -0.6011, -0.3149],
          [-0.0324, -0.3601, -0.5815,  ..., -0.3540,  0.0344, -0.1820],
          [ 2.1719,  0.3870,  2.4609,  ...,  0.0185, -0.5635,  0.3682],
          ...,
          [-0.3491,  2.3477,  0.7144,  ...,  1.0020, -0.2527, -1.0068],
          [-0.1345, -0.4692,  0.5273,  ...,  0.5205, -0.1136,  0.1858],
          [-0.4377,  1.5996,  0.7803,  ...,  1.2422,  0.9648,  2.1523]]]],
device='cuda:0', dtype=torch.float16)

System Info

diffusers version: 0.11.1
Platform: Windows-10-10.0.19041-SP0
Python version: 3.8.5
PyTorch version (GPU?): 1.13.0+cu116 (True)
Huggingface_hub version: 0.11.1
Transformers version: 4.25.1
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: No, single GPU

IdealWaffle commented 1 year ago

It has something to do with the xFormers attention. Works when I disable it, but takes up more VRAM & time to train as a result.

David-Hari commented 1 year ago

Strangely, I did not get the nan error today. I ran the script twice (not resuming from checkpoint) and both times worked. So not sure what's going on there.

I would still like to know how the save_weights function is supposed to work though.

David-Hari commented 1 year ago

Just to note, I do still get the nan errors some times, especially when running for longer. So if anyone else has encountered this problem and has found a solution I would really like to know.

In the mean time, I will continue to experiment.

David-Hari commented 1 year ago

I updated the diffusers library to the latest version, I had to install from source (GitHub main branch) because the 0.11.1 release they published on pip does not include some of the newer num_cycles and power arguments to the get_scheduler function. That seemed to solve the loss error, though the resulting images aren't that great so I guess I'll have to play around with the parameters until I find something that works.

David-Hari commented 1 year ago

Maybe the nan problem is happening because I generate sample prompts at times during training. I was able to run it for 8000 steps without nan loss when I omitted save_sample_prompt.

The question then is why does saving samples affect loss? Could it have something to do with torch.cuda.empty_cache()? I found that I would occasionally run out of memory unless I did that both before and after using StableDiffusionPipeline to generate the samples. Maybe some part of the model being trained gets affected by generating the samples.

By the way, this is the code I am using to generate samples (pretty much the same as original code in save_weights):

torch.cuda.empty_cache()
pipeline = StableDiffusionPipeline.from_pretrained(
    args.pretrained_model_name_or_path,
    unet=accelerator.unwrap_model(unet),
    text_encoder=accelerator.unwrap_model(text_encoder),
    revision=args.revision,
)
pipeline = pipeline.to(accelerator.device)
g_cuda = torch.Generator(device=accelerator.device).manual_seed(args.seed) if args.seed is not None else None
pipeline.set_progress_bar_config(disable=True)
with torch.autocast('cuda'), torch.inference_mode():
    for i in tqdm(range(len(prompts_list)), desc='Generating samples'):
        sample_image = pipeline(
            args.save_sample_prompts,
            negative_prompt=args.save_sample_negative_prompt,
            guidance_scale=args.save_guidance_scale,
            num_inference_steps=args.save_infer_steps,
            generator=g_cuda
        ).images[0]
        sample_image.save(os.path.join(args.output_dir, f'{global_step}-{i}.png'))
del pipeline
torch.cuda.empty_cache()

ShivamShrirao / diffusers