DreamBooth LoRA: Can I get similar model performance in shorter time for a multi gpu setup (train_network.py)?

Hi, I'm trying to test training a dreambooth LoRA for SD1.5 faster on 4 GPUs as compared to 1 GPU. Would love any and all help!

I'm using code from the commit https://github.com/agarwalml/kohya_ss/commit/6c69b893e131dc428a21411f6212ee58a0819d30

With one GPU, with around 1600 steps, I get a good LoRA trained with the parameters below (taken from output json). However, this takes around 20 minutes or so, so I wanted to cut this time down as much as much as possible using multiple GPUs. I thought simply increasing the number of GPUs in the accelerate command num_processes 4 would do the trick but from multiple sources online, this doesn't seem to work that way and it takes similar time.

I saw a source ( https://www.pugetsystems.com/labs/hpc/multi-gpu-sd-training/ ) which claimed if I put max_train_epochs = 1 (in the toml) with num_processes 4 (4 GPUs) in the accelerate command, I could achieve the effect I wanted, but all this does is reduce the number of train_steps from 1600 to 275. Also, the trained model's quality is bad since it doesn't get me my desired level of LoRA quality (checked with sample images).

Is it possible to cut down LoRA training time while maintaining quality using multiple GPUs? If so, what settings should I change to make this possible?

Note: Sorry for the long post, I wanted to provide all the required context.

Parameters:

{
  "LoRA_type": "Standard",
  "LyCORIS_preset": "full",
  "adaptive_noise_scale": 0,
  "additional_parameters": "",
  "async_upload": false,
  "block_alphas": "",
  "block_dims": "",
  "block_lr_zero_threshold": "",
  "bucket_no_upscale": true,
  "bucket_reso_steps": 64,
  "bypass_mode": false,
  "cache_latents": false,
  "cache_latents_to_disk": false,
  "caption_dropout_every_n_epochs": 0,
  "caption_dropout_rate": 0.1,
  "caption_extension": ".txt",
  "clip_skip": 1,
  "color_aug": false,
  "constrain": 0,
  "conv_alpha": 1,
  "conv_block_alphas": "",
  "conv_block_dims": "",
  "conv_dim": 1,
  "dataset_config": "",
  "debiased_estimation_loss": false,
  "decompose_both": false,
  "dim_from_weights": false,
  "dora_wd": false,
  "down_lr_weight": "",
  "dynamo_backend": "no",
  "dynamo_mode": "default",
  "dynamo_use_dynamic": false,
  "dynamo_use_fullgraph": false,
  "enable_bucket": true,
  "epoch": 1,
  "extra_accelerate_launch_args": "",
  "factor": -1,
  "flip_aug": false,
  "fp8_base": false,
  "full_bf16": false,
  "full_fp16": false,
  "gpu_ids": "",
  "gradient_accumulation_steps": 1,
  "gradient_checkpointing": false,
  "huber_c": 0.1,
  "huber_schedule": "snr",
  "huggingface_path_in_repo": "",
  "huggingface_repo_id": "",
  "huggingface_repo_type": "",
  "huggingface_repo_visibility": "",
  "huggingface_token": "",
  "ip_noise_gamma": 0,
  "ip_noise_gamma_random_strength": false,
  "keep_tokens": 0,
  "learning_rate": 0.0001,
  "log_tracker_config": "",
  "log_tracker_name": "",
  "log_with": "",
  "logging_dir": "",
  "lora_network_weights": "",
  "loss_type": "l2",
  "lr_scheduler": "cosine",
  "lr_scheduler_args": "",
  "lr_scheduler_num_cycles": 1,
  "lr_scheduler_power": 1,
  "lr_warmup": 10,
  "main_process_port": 0,
  "masked_loss": false,
  "max_bucket_reso": 2048,
  "max_data_loader_n_workers": 0,
  "max_grad_norm": 1,
  "max_resolution": "512,512",
  "max_timestep": 1000,
  "max_token_length": 75,
  "max_train_epochs": 0,
  "max_train_steps": 1600,
  "mem_eff_attn": false,
  "metadata_author": "",
  "metadata_description": "",
  "metadata_license": "",
  "metadata_tags": "",
  "metadata_title": "",
  "mid_lr_weight": "",
  "min_bucket_reso": 256,
  "min_snr_gamma": 0,
  "min_timestep": 0,
  "mixed_precision": "fp16",
  "model_list": "custom",
  "module_dropout": 0,
  "multi_gpu": false,
  "multires_noise_discount": 0.3,
  "multires_noise_iterations": 0,
  "network_alpha": 48,
  "network_dim": 96,
  "network_dropout": 0,
  "noise_offset": 0.1,
  "noise_offset_random_strength": false,
  "noise_offset_type": "Original",
  "num_cpu_threads_per_process": 2,
  "num_machines": 1,
  "num_processes": 1,
  "optimizer": "AdamW8bit",
  "optimizer_args": "",
  "output_dir": "/home/ubuntu/work/kohya_ss/outputs",
  "output_name": "redactedname",
  "persistent_data_loader_workers": false,
  "pretrained_model_name_or_path": "runwayml/stable-diffusion-v1-5",
  "prior_loss_weight": 1,
  "random_crop": false,
  "rank_dropout": 0,
  "rank_dropout_scale": false,
  "reg_data_dir": "/home/ubuntu/work/redactedname_lora/reg",
  "rescaled": false,
  "resume": "",
  "resume_from_huggingface": "",
  "sample_every_n_epochs": 0,
  "sample_every_n_steps": 50,
  "sample_prompts": "redactedname, a photo of a man",
  "sample_sampler": "euler_a",
  "save_every_n_epochs": 1,
  "save_every_n_steps": 500,
  "save_last_n_steps": 0,
  "save_last_n_steps_state": 0,
  "save_model_as": "safetensors",
  "save_precision": "fp16",
  "save_state": false,
  "save_state_on_train_end": false,
  "save_state_to_huggingface": false,
  "scale_v_pred_loss_like_noise_pred": false,
  "scale_weight_norms": 0,
  "sdxl": false,
  "sdxl_cache_text_encoder_outputs": false,
  "sdxl_no_half_vae": false,
  "seed": 0,
  "shuffle_caption": true,
  "stop_text_encoder_training_pct": 0,
  "text_encoder_lr": 5e-05,
  "train_batch_size": 2,
  "train_data_dir": "/home/ubuntu/work/redactedname_lora/img",
  "train_norm": false,
  "train_on_input": true,
  "training_comment": "",
  "unet_lr": 0.0001,
  "unit": 1,
  "up_lr_weight": "",
  "use_cp": false,
  "use_scalar": false,
  "use_tucker": false,
  "v2": false,
  "v_parameterization": false,
  "v_pred_like_loss": 0,
  "vae": "",
  "vae_batch_size": 0,
  "wandb_api_key": "",
  "wandb_run_name": "",
  "weighted_captions": false,
  "xformers": "xformers"
}

Note the img folder contains the folder "50_redactedname man" with 22 captioned images and the regularization folder contains the folder "1_man" with 1100 images.

I run with the standard single GPU command kohya_ss/venv/bin/accelerate launch --dynamo_backend "no" --dynamo_mode "default" --mixed_precision "fp16" --num_processes 1 --num_machines 1 --num_cpu_threads_per_process 2 kohya_ss/sd-scripts/train_network.py --config_file redactedname.toml, I get a LoRA trained on my person (redactedname).

Note: I'm using code from the commit https://github.com/agarwalml/kohya_ss/commit/6c69b893e131dc428a21411f6212ee58a0819d30

Haven't upgraded yet since this took some effort to run on my local and I didn't want to break stuff unnecessarily

Judging the latest version of train_network.py, I don't see much difference but I may be wrong. Thanks for all your help!

bmaltais / kohya_ss

DreamBooth LoRA: Can I get similar model performance in shorter time for a multi gpu setup (train_network.py)? #2926