bmaltais / kohya_ss

Apache License 2.0
9.67k stars 1.24k forks source link

Incorrect calculation of “Max train steps”. #2965

Open iqddd opened 4 hours ago

iqddd commented 4 hours ago

When the number of epochs is set, but 'Max train epoch' and 'Max train steps' are both set to 0 (meaning no override), 'Max train steps' is then automatically calculated using the following formula: Number of images * Number of epochs / Batch size. But this formula doesn't take into account that when bucketing, some buckets will have a smaller batch_size (if the number of images in the bucket is not a multiple of batch_size). But sd-scripts take bucketing into account and set the correct step count per epoch. However, since the GUI sets 'Max train steps', the actual number of epochs is fewer than specified in the GUI. For example: Analysis from GUI before calling sd-scripts and setting "max_train_steps".

Folder 1_diamel_xl: 1 repeats found
Folder 1_diamel_xl: 63 images found
Folder 1_diamel_xl: 63 * 1 = 63 steps
Regularization factor: 1
Train batch size: 4
Gradient accumulation steps: 1
Epoch: 25
max_train_steps (63 / 4 / 1 * 25 * 1) = 394

Information about buckets from sd-scripts:

bucket 0: resolution (704, 1344), count: 2          //1 step
bucket 1: resolution (768, 1280), count: 3      //1 step
bucket 2: resolution (832, 1216), count: 22     //6 step
bucket 3: resolution (896, 1152), count: 17     //5 step
bucket 4: resolution (960, 1088), count: 5      //2 step
bucket 5: resolution (1024, 1024), count: 9     //3 step
bucket 6: resolution (1088, 960), count: 2      //1 step
bucket 7: resolution (1152, 896), count: 2      //1 step
bucket 8: resolution (1216, 832), count: 1      //1 step

Summing up, we get 21 steps per epoch. Which is confirmed by further output in the console:

  num train images * repeats / 学習画像の数×繰り返し回数: 63
  num reg images / 正則化画像の数: 0
  num batches per epoch / 1epochのバッチ数: 21
  num epochs / epoch数: 19
  batch size per device / バッチサイズ: 4
  gradient accumulation steps / 勾配を合計するステップ数 = 1
  total optimization steps / 学習ステップ数: 394

num epochs / epoch: 19 instead of 25.

iqddd commented 4 hours ago

As a workaround, you can set 'Max train epoch' = 'Epoch' and set 'Max train steps' extremely large (for example 999999). sd-scripts console output:

override steps. steps for 25 epochs is / 指定エポックまでのステップ数: 525
enable full fp16 training.
running training / 学習開始
  num train images * repeats / 学習画像の数×繰り返し回数: 63
  num reg images / 正則化画像の数: 0
  num batches per epoch / 1epochのバッチ数: 21
  num epochs / epoch数: 25
  batch size per device / バッチサイズ: 4
  gradient accumulation steps / 勾配を合計するステップ数 = 1
  total optimization steps / 学習ステップ数: 525