LambdaLabsML / examples

Deep Learning Examples
MIT License
805 stars 103 forks source link

Has any able to train successfully? #30

Open soon-yau opened 1 year ago

soon-yau commented 1 year ago

I wonder if anyone has successfully finetune the model? I have difficulty training using the code, either finetuning from StableDiffusion checkpoint or from scratch.

I noticed that after around 2000 steps, the loss spiked from 0.1 to 1 then stuck there. Inference from checkpoints show that when loss was 0.1, the generated images was "ok" i.e. if finetune, still able to generate recognizable shape of my custom dataset despite not pretty; if from scratch, can see blob of noise. However, when the loss stuck at 1.0, all it can generate is black.

I tried different random seed but didn't help.

gstswwx commented 1 year ago

try a smaller learning rate

KyonP commented 1 year ago

I am also struggling to finetune with a custom dataset on the cartoon domain.

Following suggestions, I have tried smaller batch size (1 with accumulated batch size 4) and learning rates, and outputs look like partially illustrated characters appeared all over an image.

Still, I couldn't achieve successful results.

If there is other advice, I would be very grateful.

soon-yau commented 1 year ago

I have tried smaller learning rate but I think the instability comes from small batch size of 1, which is limited by my GPU memory.

I then switched to official Stable Diffusion repo (basically a LDM), reduce the Unet channel dimension, set batch size=16 and train the mode from scratch, then it is working.

KyonP commented 1 year ago

I have tried smaller learning rate but I think the instability comes from small batch size of 1, which is limited by my GPU memory.

I then switched to official Stable Diffusion repo (basically a LDM), reduce the Unet channel dimension, set batch size=16 and train the mode from scratch, then it is working.

I'll try this thanks,

BTW, is this repo (LambdaLabs Poke'mon examples) significantly different from the official Stable Diffusion?

I am just asking because of curiosity since I haven't dug into this repo enough.

KyonP commented 1 year ago

I only get these noisy images.

image

I am not sure where did I make mistake. followings are my config setting:

model: #https://github.com/justinpinkney/stable-diffusion/blob/main/configs/stable-diffusion/pokemon.yaml
  # reference : https://github.com/CompVis/latent-diffusion/issues/132
  base_learning_rate: 1.0e-04
  target: ldm.models.diffusion.ddpm.LatentDiffusion
  params:
    linear_start: 0.00085
    linear_end: 0.012
    num_timesteps_cond: 1
    log_every_t: 200
    timesteps: 1000
    first_stage_key: image
    cond_stage_key: txt
    image_size: 32 
    channels: 3 
    cond_stage_trainable: False
    conditioning_key: crossattn
    monitor: val/loss_simple_ema
    scale_factor: 0.18215
    use_ema: False

    unet_config:
      target: ldm.modules.diffusionmodules.openaimodel.UNetModel
      params:
        image_size: 64
        in_channels: 3
        out_channels: 3
        model_channels: 224
        attention_resolutions:
          - 8
          - 4
          - 2
        num_res_blocks: 2
        channel_mult:
          - 1
          - 2
          - 3
          - 4
        num_head_channels: 32

    first_stage_config:
      target: ldm.models.autoencoder.AutoencoderKL
      ckpt_path: "models/first_stage_models/kl-f8/model.ckpt"
      params:
        embed_dim: 3 #4
        monitor: val/rec_loss
        ddconfig:
          double_z: true
          z_channels: 3
          resolution: 256
          in_channels: 3
          out_ch: 3
          ch: 128
          ch_mult:
          - 1
          - 2
          - 4
          num_res_blocks: 2
          attn_resolutions: []
          dropout: 0.0
        lossconfig:
          target: torch.nn.Identity

    cond_stage_config:
      target: ldm.modules.encoders.modules.FrozenCLIPEmbedder

data:
  target: main.DataModuleFromConfig
  params:
    batch_size: 16
    num_workers: 32
    train:
      target: ldm.data.cartoon_dataloader.CartoonDataset
      params:
        media_dir: data/cartoon_data
        media_size: 128
        split: "train"
    validation:
      target: ldm.data.cartoon_dataloader.CartoonDataset
      params:
        media_dir: data/cartoon_data
        media_size: 128
        split: "val"

lightning:
  find_unused_parameters: False

  modelcheckpoint:
    params:
      every_n_train_steps: 20000
      save_top_k: -1
      monitor: null

  callbacks:
    image_logger:
      target: main.ImageLogger
      params:
        batch_frequency: 2000
        max_images: 4
        increase_log_steps: False
        log_first_step: True
        log_all_val: True
        log_images_kwargs:
          use_ema_scope: True
          inpaint: False
          plot_progressive_rows: False
          plot_diffusion_rows: False
          N: 4
          unconditional_guidance_scale: 3.0
          unconditional_guidance_label: [""]

  trainer:
    benchmark: True
    num_sanity_val_steps: 0
    limit_train_batches: 0.01 # currently only use partial
    limit_val_batches: 0.001 # 0.05
    limit_test_batches: 1.0 # 0.05

any suggestions would be deeply grateful.

Salv1a commented 1 year ago

@KyonP Maybe you forgot add '--' before 'finetune from ...' in command.

nocol0101001 commented 1 year ago

@KyonP Maybe you forgot add '--' before 'finetune from ...' in command.

!(python main.py -t --base configs/stable-diffusion/pokemon.yaml --gpus "0," --scale_lr False --num_nodes 1 --check_val_every_n_epoch 10 --finetune_from "$ckpt_path" data.params.batch_size="2" lightning.trainer.accumulate_grad_batches="1" data.params.validation.params.n_gpus="$NUM_GPUS" )

Hello, what I get from training like this is still a mass of noise, what is the cause. Thanks

Salv1a commented 1 year ago

@KyonP Maybe you forgot add '--' before 'finetune from ...' in command.

!(python main.py -t --base configs/stable-diffusion/pokemon.yaml --gpus "0," --scale_lr False --num_nodes 1 --check_val_every_n_epoch 10 --finetune_from "$ckpt_path" data.params.batch_size="2" lightning.trainer.accumulate_grad_batches="1" data.params.validation.params.n_gpus="$NUM_GPUS" )

Hello, what I get from training like this is still a mass of noise, what is the cause. Thanks

try running txt2img.py, using the checkpoint in your $ckpt_path. If you still get noisy images, maybe your checkpoint model is incorrect.

nocol0101001 commented 1 year ago

Hello, this is my training and testing code, and last.ckpt is the checkpoint I saved after fine-tuning.The checkpoint is about 13 G in size. Is there anything wrong? Greatful

------------------ 原始邮件 ------------------ 发件人: "LambdaLabsML/examples" @.>; 发送时间: 2023年5月4日(星期四) 下午5:36 @.>; @.**@.>; 主题: Re: [LambdaLabsML/examples] Has any able to train successfully? (Issue #30)

@KyonP Maybe you forgot add '--' before 'finetune from ...' in command.

!(python main.py -t --base configs/stable-diffusion/pokemon.yaml --gpus "0," --scale_lr False --num_nodes 1 --check_val_every_n_epoch 10 --finetune_from "$ckpt_path" data.params.batch_size="2" lightning.trainer.accumulate_grad_batches="1" data.params.validation.params.n_gpus="$NUM_GPUS" )

Hello, what I get from training like this is still a mass of noise, what is the cause. Thanks

try running txt2img.py, using the checkpoint in your $ckpt_path. If you still get noisy images, maybe your checkpoint model is incorrect.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

nocol0101001 commented 1 year ago

@KyonP Maybe you forgot add '--' before 'finetune from ...' in command.

!(python main.py -t --base configs/stable-diffusion/pokemon.yaml --gpus "0," --scale_lr False --num_nodes 1 --check_val_every_n_epoch 10 --finetune_from "$ckpt_path" data.params.batch_size="2" lightning.trainer.accumulate_grad_batches="1" data.params.validation.params.n_gpus="$NUM_GPUS" ) Hello, what I get from training like this is still a mass of noise, what is the cause. Thanks

try running txt2img.py, using the checkpoint in your $ckpt_path. If you still get noisy images, maybe your checkpoint model is incorrect. Hello, this is my training and testing code, and last.ckpt is the checkpoint I saved after fine-tuning.The checkpoint is about 13 G in size. Is there anything wrong? Greatful image image

Salv1a commented 1 year ago

@KyonP Maybe you forgot add '--' before 'finetune from ...' in command.

!(python main.py -t --base configs/stable-diffusion/pokemon.yaml --gpus "0," --scale_lr False --num_nodes 1 --check_val_every_n_epoch 10 --finetune_from "$ckpt_path" data.params.batch_size="2" lightning.trainer.accumulate_grad_batches="1" data.params.validation.params.n_gpus="$NUM_GPUS" ) Hello, what I get from training like this is still a mass of noise, what is the cause. Thanks

try running txt2img.py, using the checkpoint in your $ckpt_path. If you still get noisy images, maybe your checkpoint model is incorrect. Hello, this is my training and testing code, and last.ckpt is the checkpoint I saved after fine-tuning.The checkpoint is about 13 G in size. Is there anything wrong? Greatful image image

Still noisy pics? It looks all right. Check the fine-tuning logdir, did you get correct pics? You can send me more information by email -> wangsheng_0205@163.com.

lvsi-qi commented 1 year ago

也许您忘记在“微调来自...”之前添加“--”在命令中。

!(python main.py --base configs/stable-diffusion/pokemon.yaml --gpus “0,” --scale_lr false --num_nodes 1 --check_val_every_n_epoch 10 --finetune_from “$ckpt_path” data.params.batch_size=“2” lightning.trainer.accumulate_grad_batches=“1” data.params.validation.params.n_gpus=“$NUM_GPUS” )你好,我从这样的训练中得到的仍然是一团噪音,是什么原因。谢谢

尝试使用 $ckpt_path 中的检查点运行 txt2img.py。如果您仍然收到噪点图像,则可能是您的检查点模型不正确。您好,这是我的训练和测试代码,last.ckpt 是我在微调后保存的检查点。检查点的大小约为 13 G。有什么问题吗?太好了 image image

仍然嘈杂的照片?看起来没问题。检查微调日志,你得到正确的照片吗?您可以通过电子邮件向我发送更多信息 -> wangsheng_0205@163.com

Hello, I would like to ask how much epoch is and how often does checkpoint save it?

JaosonMa commented 1 year ago

https://github.com/LambdaLabsML/examples/blob/main/stable-diffusion-finetuning/pokemon_finetune.ipynb i use this to retrain pokemon: `export MODEL_NAME="runwayml/stable-diffusion-v1-5" export OUTPUT_DIR="./pokemon"

export HUB_MODEL_ID="pokemon-lora"

export DATASET_NAME="lambdalabs/pokemon-blip-captions"

export TRAIN_DIR="./images/train" export CACHE_DIR="./text_to_image/cache_dir"

export CUDA_VISIBLE_DEVICES=1

accelerate launch --mixed_precision="fp16" train_text_to_image_lora.py \ --pretrained_model_name_or_path=$MODEL_NAME \ --train_data_dir=${TRAIN_DIR} \ --cache_dir=${CACHE_DIR} \ --dataloader_num_workers=8 \ --resolution=512 --center_crop --random_flip \ --train_batch_size=1 \ --gradient_accumulation_steps=4 \ --max_train_steps=15000 \ --learning_rate=1e-04 \ --max_grad_norm=1 \ --lr_scheduler="cosine" --lr_warmup_steps=0 \ --output_dir=${OUTPUT_DIR} \ --checkpointing_steps=1000 \ --report_to=wandb \ --resume_from_checkpoint="latest" \ --validation_prompt="A pokemon with blue eyes." \ --seed=1337` but the loss continue jump Oscillating around 0.5. why? i also try use batch_size=4,same result image