Open yunchangxiaoguan opened 4 weeks ago
same issue for me, were you able to get it working? I have 2 A6000 cards.
chatGPT to the rescue.... change line 4427 in train_util.py
# OLD
# encoder_hidden_states = text_encoder.text_model.final_layer_norm(encoder_hidden_states)
#NEW
encoder_hidden_states = text_encoder.module.text_model.final_layer_norm(encoder_hidden_states)
20:33:30-999499 INFO Start training Dreambooth...
20:33:31-005606 INFO Validating lr scheduler arguments...
20:33:31-008506 INFO Validating optimizer arguments...
20:33:31-011321 INFO Validating /home/gx/kohya_ss/dataset/logs existence and writability...
SUCCESS
20:33:31-014568 INFO Validating /home/gx/kohya_ss/dataset/outputs existence and writability...
SUCCESS
20:33:31-017163 INFO Validating
/home/gx/stable-diffusion-webui/models/Stable-diffusion/majicmixRealistic_v7 .safetensors existence... SUCCESS
20:33:31-019518 INFO Validating /home/gx/kohya_ss/dataset/images existence... SUCCESS
20:33:31-021425 INFO Headless mode, skipping verification if model already exist... if model
already exist it will be overwritten...
20:33:31-023769 INFO Folder 100_ccpao: 100 repeats found
20:33:31-025703 INFO Folder 100_ccpao: 25 images found
20:33:31-026965 INFO Folder 100_ccpao: 25 * 100 = 2500 steps
20:33:31-028628 INFO Regulatization factor: 1
20:33:31-029955 INFO Total steps: 2500
20:33:31-031200 INFO Train batch size: 1
20:33:31-032432 INFO Gradient accumulation steps: 1
20:33:31-033651 INFO Epoch: 1
20:33:31-034837 INFO Max train steps: 1600
20:33:31-036077 INFO lr_warmup_steps = 160
20:33:31-039112 INFO Saving training config to
/home/gx/kohya_ss/dataset/outputs/last_20240607-203331.json...
20:33:31-041080 INFO Executing command: /home/gx/anaconda3/envs/ss/bin/accelerate launch
--dynamo_backend no --dynamo_mode default --gpu_ids 2,3,4,5
--mixed_precision fp16 --num_processes 1 --num_machines 1
--num_cpu_threads_per_process 2 /home/gx/kohya_ss/sd-scripts/train_db.py
--config_file
/home/gx/kohya_ss/dataset/outputs/config_dreambooth-20240607-203331.toml
20:33:31-044693 INFO Command executed.
The following values were not passed to
accelerate launch
and had defaults used instead: More than one GPU was found, enabling multi-GPU training. If this was unintended please pass in--num_processes=1
. To avoid this warning pass in values for each of the problematic parameters or runaccelerate config
. Using RTX 3090 or 4000 series which doesn't support faster communication speedups. Ensuring P2P and IB communications are disabled. /home/gx/anaconda3/envs/ss/lib/python3.10/site-packages/diffusers/utils/outputs.py:63: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead. torch.utils._pytree._register_pytree_node( 2024-06-07 20:33:39.487782: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered 2024-06-07 20:33:39.487866: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered 2024-06-07 20:33:39.489331: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered 2024-06-07 20:33:39.497473: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2024-06-07 20:33:40.611158: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT /home/gx/anaconda3/envs/ss/lib/python3.10/site-packages/diffusers/utils/outputs.py:63: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead. torch.utils._pytree._register_pytree_node( 2024-06-07 20:33:42 INFO Loading settings from train_util.py:3744 /home/gx/kohya_ss/dataset/outputs/config_dreambooth-20240607-203331.toml...
INFO /home/gx/kohya_ss/dataset/outputs/config_dreambooth-2 train_util.py:3763 0240607-203331
2024-06-07 20:33:42 INFO prepare tokenizer train_util.py:4227 INFO update token length: 75 train_util.py:4244 2024-06-07 20:33:43 INFO prepare images. train_util.py:1572 INFO found directory train_util.py:1519 /home/gx/kohya_ss/dataset/images/100_ccpao contains
25 image files
INFO 2500 train images with repeating. train_util.py:1613 INFO 0 reg images. train_util.py:1616 WARNING no regularization images / train_util.py:1621 正則化画像が見つかりませんでした
INFO [Dataset 0] config_util.py:565 batch_size: 1
resolution: (512, 512)
enable_bucket: True
network_multiplier: 1.0
min_bucket_reso: 256
max_bucket_reso: 2048
bucket_reso_steps: 64
bucket_no_upscale: True
100%|█████████████████████████████████████████████████████████████| 25/25 [00:00<00:00, 52142.02it/s] INFO make buckets train_util.py:859 WARNING min_bucket_reso and max_bucket_reso are ignored if train_util.py:876 bucket_no_upscale is set, because bucket reso is model_util.py:1009
INFO loading vae: model_util.py:1017
2024-06-07 20:33:53 INFO loading text encoder: model_util.py:1074
2024-06-07 20:33:54 INFO Enable xformers for U-Net train_util.py:2660
INFO [Dataset 0] train_util.py:2079
INFO caching latents. train_util.py:974
INFO checking cache validity... train_util.py:984
100%|████████████████████████████████████████████████████████████| 25/25 [00:00<00:00, 270949.87it/s]
INFO caching latents... train_util.py:1021
100%|████████████████████████████████████████████████████████████████| 25/25 [00:06<00:00, 4.08it/s]
prepare optimizer, data loader etc.
2024-06-07 20:34:01 INFO use 8-bit AdamW optimizer | {} train_util.py:3889
running training / 学習開始
num train images * repeats / 学習画像の数×繰り返し回数: 2500
num reg images / 正則化画像の数: 0
num batches per epoch / 1epochのバッチ数: 2500
num epochs / epoch数: 1
batch size per device / バッチサイズ: 1
total train batch size (with parallel & distributed & accumulation) / 総バッチサイズ(並列学習、勾配合計含む): 1
gradient ccumulation steps / 勾配を合計するステップ数 = 1
total optimization steps / 学習ステップ数: 1600
steps: 0%| | 0/1600 [00:00<?, ?it/s]
epoch 1/1
Traceback (most recent call last):
File "/home/gx/kohya_ss/sd-scripts/train_db.py", line 529, in
train(args)
File "/home/gx/kohya_ss/sd-scripts/train_db.py", line 343, in train
encoder_hidden_states = train_util.get_hidden_states(
File "/home/gx/kohya_ss/sd-scripts/library/train_util.py", line 4427, in get_hidden_states
encoder_hidden_states = text_encoder.text_model.final_layer_norm(encoder_hidden_states)
File "/home/gx/anaconda3/envs/ss/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1688, in getattr
raise AttributeError(f"'{type(self).name}' object has no attribute '{name}'")
AttributeError: 'DistributedDataParallel' object has no attribute 'text_model'
steps: 0%| | 0/1600 [00:00<?, ?it/s]
[2024-06-07 20:34:05,896] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 108617) of binary: /home/gx/anaconda3/envs/ss/bin/python
Traceback (most recent call last):
File "/home/gx/anaconda3/envs/ss/bin/accelerate", line 8, in
sys.exit(main())
File "/home/gx/anaconda3/envs/ss/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
args.func(args)
File "/home/gx/anaconda3/envs/ss/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1008, in launch_command
multi_gpu_launcher(args)
File "/home/gx/anaconda3/envs/ss/lib/python3.10/site-packages/accelerate/commands/launch.py", line 666, in multi_gpu_launcher
distrib_run.run(args)
File "/home/gx/anaconda3/envs/ss/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run
elastic_launch(
File "/home/gx/anaconda3/envs/ss/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/gx/anaconda3/envs/ss/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
defined by image size automatically /
bucket_no_upscaleが指定された場合は、bucketの解像度は
画像サイズから自動計算されるため、min_bucket_resoとmax
_bucket_resoは無視されます
INFO number of images (including repeats) / train_util.py:905 各bucketの画像枚数(繰り返し回数を含む)
INFO bucket 0: resolution (512, 512), count: 2500 train_util.py:910 INFO mean ar error (without repeats): 0.0 train_util.py:915 INFO prepare accelerator train_db.py:106 WARNING Detected kernel version 3.10.0, which is below the logging.py:61 recommended minimum of 5.5.0; this can cause the process
to hang. It is recommended to upgrade the kernel to the
minimum version or higher.
accelerator device: cuda:0 INFO loading model for process 0/1 train_util.py:4385 INFO load StableDiffusion checkpoint: train_util.py:4341 /home/gx/stable-diffusion-webui/models/Stable-diffusi
on/majicmixRealistic_v7.safetensors
INFO UNet2DConditionModel: 64, 8, 768, False, False original_unet.py:1387 2024-06-07 20:33:51 INFO loading u-net:
/home/gx/kohya_ss/sd-scripts/train_db.py FAILED
Failures: