training custom dataset from scratch only outputs noises

KyonP commented 1 year ago

I am trying to train a custom dataset from the cartoon domain with text captions.

Tried out some other repo such as fine-tuning examples and optimizedSD, but I haven't been able to achieve results - they only outputted 2 types of results - partial characters from my dataset appeared all over an image or brown foggy noise images.

my config file is as follows:

model: #https://github.com/justinpinkney/stable-diffusion/blob/main/configs/stable-diffusion/pokemon.yaml
  # reference : https://github.com/CompVis/latent-diffusion/issues/132
  base_learning_rate: 1.0e-04
  target: ldm.models.diffusion.ddpm.LatentDiffusion
  params:
    linear_start: 0.00085
    linear_end: 0.012
    num_timesteps_cond: 1
    log_every_t: 200
    timesteps: 1000
    first_stage_key: image
    cond_stage_key: txt
    image_size: 32 
    channels: 3 
    cond_stage_trainable: False
    conditioning_key: crossattn
    monitor: val/loss_simple_ema
    scale_factor: 0.18215
    use_ema: False

    unet_config:
      target: ldm.modules.diffusionmodules.openaimodel.UNetModel
      params:
        image_size: 64
        in_channels: 3
        out_channels: 3
        model_channels: 224
        attention_resolutions:
          - 8
          - 4
          - 2
        num_res_blocks: 2
        channel_mult:
          - 1
          - 2
          - 3
          - 4
        num_head_channels: 32

    first_stage_config:
      target: ldm.models.autoencoder.AutoencoderKL
      ckpt_path: "models/first_stage_models/kl-f8/model.ckpt"
      params:
        embed_dim: 3 #4
        monitor: val/rec_loss
        ddconfig:
          double_z: true
          z_channels: 3
          resolution: 256
          in_channels: 3
          out_ch: 3
          ch: 128
          ch_mult:
          - 1
          - 2
          - 4
          num_res_blocks: 2
          attn_resolutions: []
          dropout: 0.0
        lossconfig:
          target: torch.nn.Identity

    cond_stage_config:
      target: ldm.modules.encoders.modules.FrozenCLIPEmbedder

data:
  target: main.DataModuleFromConfig
  params:
    batch_size: 16
    num_workers: 32
    train:
      target: ldm.data.cartoon_dataloader.CartoonDataset
      params:
        media_dir: data/cartoon_data
        media_size: 128
        split: "train"
    validation:
      target: ldm.data.cartoon_dataloader.CartoonDataset
      params:
        media_dir: data/cartoon_data
        media_size: 128
        split: "val"

lightning:
  find_unused_parameters: False

  modelcheckpoint:
    params:
      every_n_train_steps: 20000
      save_top_k: -1
      monitor: null

  callbacks:
    image_logger:
      target: main.ImageLogger
      params:
        batch_frequency: 2000
        max_images: 4
        increase_log_steps: False
        log_first_step: True
        log_all_val: True
        log_images_kwargs:
          use_ema_scope: True
          inpaint: False
          plot_progressive_rows: False
          plot_diffusion_rows: False
          N: 4
          unconditional_guidance_scale: 3.0
          unconditional_guidance_label: [""]

  trainer:
    benchmark: True
    num_sanity_val_steps: 0
    limit_train_batches: 0.01 # currently only use partial
    limit_val_batches: 0.001 # 0.05
    limit_test_batches: 1.0 # 0.05

while undergoing several hyperparameters tunning - lowering learning rate, reducing UNet dimensions etc., every time the validation outputs this kind of image:

I am not sure where did I make mistake. any suggestions would be greatly appreciated.

tenghui98 commented 1 year ago

Just like you. I also encountered this problem.

tenghui98 commented 1 year ago

https://github.com/LambdaLabsML/examples/issues/33

KyonP commented 1 year ago

Just like you. I also encountered this problem.

I am not sure I made the right move, however, I managed to achieve "looks-okay" images.

In my case, I changed the size of the input images to as same as the LambdaLabs' pokemon examples (512*512).

and it worked "okay," but the reason why I keep using quotes is that the input images were only a small portion of them for fast development.

as soon as I inputted the full batches, it collapsed, not like the noise images but cloudy ones. (maybe it is because the dataset had to be upscaled to 512 since it has low resolution)

it seems like hyperparameter searching and tuning are essential.

I hope this small article helps.

LilyDaytoy commented 1 year ago

LambdaLabsML/examples#33

Hi, can I ask that how do you find other repos showing finetuning text2image diffusion model on customized datasets? I want to fine tune text2image diffusion model on my own dataset using this latent-diffusion repo, but I do not know how I could create the dataset file and config, and also the script to train, do you know if there are some instructions or guidance about it? Thanks a lot!

yangyuke001 commented 1 year ago

@LilyDaytoy Hi,Do you have a solution now?

CompVis / stable-diffusion

training custom dataset from scratch only outputs noises #492