yunchangxiaoguan commented 4 weeks ago

20:33:30-999499 INFO Start training Dreambooth...
20:33:31-005606 INFO Validating lr scheduler arguments...
20:33:31-008506 INFO Validating optimizer arguments...
20:33:31-011321 INFO Validating /home/gx/kohya_ss/dataset/logs existence and writability...
SUCCESS
20:33:31-014568 INFO Validating /home/gx/kohya_ss/dataset/outputs existence and writability...
SUCCESS
20:33:31-017163 INFO Validating
/home/gx/stable-diffusion-webui/models/Stable-diffusion/majicmixRealistic_v7 .safetensors existence... SUCCESS
20:33:31-019518 INFO Validating /home/gx/kohya_ss/dataset/images existence... SUCCESS
20:33:31-021425 INFO Headless mode, skipping verification if model already exist... if model
already exist it will be overwritten...
20:33:31-023769 INFO Folder 100_ccpao: 100 repeats found
20:33:31-025703 INFO Folder 100_ccpao: 25 images found
20:33:31-026965 INFO Folder 100_ccpao: 25 * 100 = 2500 steps
20:33:31-028628 INFO Regulatization factor: 1
20:33:31-029955 INFO Total steps: 2500
20:33:31-031200 INFO Train batch size: 1
20:33:31-032432 INFO Gradient accumulation steps: 1
20:33:31-033651 INFO Epoch: 1
20:33:31-034837 INFO Max train steps: 1600
20:33:31-036077 INFO lr_warmup_steps = 160
20:33:31-039112 INFO Saving training config to
/home/gx/kohya_ss/dataset/outputs/last_20240607-203331.json...
20:33:31-041080 INFO Executing command: /home/gx/anaconda3/envs/ss/bin/accelerate launch
--dynamo_backend no --dynamo_mode default --gpu_ids 2,3,4,5
--mixed_precision fp16 --num_processes 1 --num_machines 1
--num_cpu_threads_per_process 2 /home/gx/kohya_ss/sd-scripts/train_db.py
--config_file
/home/gx/kohya_ss/dataset/outputs/config_dreambooth-20240607-203331.toml
20:33:31-044693 INFO Command executed.
The following values were not passed to accelerate launch and had defaults used instead: More than one GPU was found, enabling multi-GPU training. If this was unintended please pass in --num_processes=1. To avoid this warning pass in values for each of the problematic parameters or run accelerate config. Using RTX 3090 or 4000 series which doesn't support faster communication speedups. Ensuring P2P and IB communications are disabled. /home/gx/anaconda3/envs/ss/lib/python3.10/site-packages/diffusers/utils/outputs.py:63: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead. torch.utils._pytree._register_pytree_node( 2024-06-07 20:33:39.487782: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered 2024-06-07 20:33:39.487866: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered 2024-06-07 20:33:39.489331: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered 2024-06-07 20:33:39.497473: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2024-06-07 20:33:40.611158: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT /home/gx/anaconda3/envs/ss/lib/python3.10/site-packages/diffusers/utils/outputs.py:63: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead. torch.utils._pytree._register_pytree_node( 2024-06-07 20:33:42 INFO Loading settings from train_util.py:3744 /home/gx/kohya_ss/dataset/outputs/config_dreambooth-2
0240607-203331.toml...
INFO /home/gx/kohya_ss/dataset/outputs/config_dreambooth-2 train_util.py:3763 0240607-203331
2024-06-07 20:33:42 INFO prepare tokenizer train_util.py:4227 INFO update token length: 75 train_util.py:4244 2024-06-07 20:33:43 INFO prepare images. train_util.py:1572 INFO found directory train_util.py:1519 /home/gx/kohya_ss/dataset/images/100_ccpao contains
25 image files
INFO 2500 train images with repeating. train_util.py:1613 INFO 0 reg images. train_util.py:1616 WARNING no regularization images / train_util.py:1621 正則化画像が見つかりませんでした
INFO [Dataset 0] config_util.py:565 batch_size: 1
resolution: (512, 512)
enable_bucket: True
network_multiplier: 1.0
min_bucket_reso: 256
max_bucket_reso: 2048
bucket_reso_steps: 64
bucket_no_upscale: True

                           [Subset 0 of Dataset 0]                                               
                             image_dir:                                                          
                         "/home/gx/kohya_ss/dataset/images/100_ccpao"                            
                             image_count: 25                                                     
                             num_repeats: 100                                                    
                             shuffle_caption: False                                              
                             keep_tokens: 0                                                      
                             keep_tokens_separator:                                              
                             secondary_separator: None                                           
                             enable_wildcard: False                                              
                             caption_dropout_rate: 0.0                                           
                             caption_dropout_every_n_epoches: 0                                  
                             caption_tag_dropout_rate: 0.0                                       
                             caption_prefix: None                                                
                             caption_suffix: None                                                
                             color_aug: False                                                    
                             flip_aug: False                                                     
                             face_crop_aug_range: None                                           
                             random_crop: False                                                  
                             token_warmup_min: 1,                                                
                             token_warmup_step: 0,                                               
                             is_reg: False                                                       
                             class_tokens: ccpao                                                 
                             caption_extension: .txt                                             

                INFO     [Dataset 0]                                           config_util.py:571
                INFO     loading image sizes.                                   train_util.py:853

100%|█████████████████████████████████████████████████████████████| 25/25 [00:00<00:00, 52142.02it/s] INFO make buckets train_util.py:859 WARNING min_bucket_reso and max_bucket_reso are ignored if train_util.py:876 bucket_no_upscale is set, because bucket reso is
defined by image size automatically /
bucket_no_upscaleが指定された場合は、bucketの解像度は
画像サイズから自動計算されるため、min_bucket_resoとmax
_bucket_resoは無視されます
INFO number of images (including repeats) / train_util.py:905 各bucketの画像枚数（繰り返し回数を含む）
INFO bucket 0: resolution (512, 512), count: 2500 train_util.py:910 INFO mean ar error (without repeats): 0.0 train_util.py:915 INFO prepare accelerator train_db.py:106 WARNING Detected kernel version 3.10.0, which is below the logging.py:61 recommended minimum of 5.5.0; this can cause the process
to hang. It is recommended to upgrade the kernel to the
minimum version or higher.
accelerator device: cuda:0 INFO loading model for process 0/1 train_util.py:4385 INFO load StableDiffusion checkpoint: train_util.py:4341 /home/gx/stable-diffusion-webui/models/Stable-diffusi
on/majicmixRealistic_v7.safetensors
INFO UNet2DConditionModel: 64, 8, 768, False, False original_unet.py:1387 2024-06-07 20:33:51 INFO loading u-net: model_util.py:1009 INFO loading vae: model_util.py:1017 2024-06-07 20:33:53 INFO loading text encoder: model_util.py:1074 2024-06-07 20:33:54 INFO Enable xformers for U-Net train_util.py:2660 INFO [Dataset 0] train_util.py:2079 INFO caching latents. train_util.py:974 INFO checking cache validity... train_util.py:984 100%|████████████████████████████████████████████████████████████| 25/25 [00:00<00:00, 270949.87it/s] INFO caching latents... train_util.py:1021 100%|████████████████████████████████████████████████████████████████| 25/25 [00:06<00:00, 4.08it/s] prepare optimizer, data loader etc. 2024-06-07 20:34:01 INFO use 8-bit AdamW optimizer | {} train_util.py:3889 running training / 学習開始 num train images * repeats / 学習画像の数×繰り返し回数: 2500 num reg images / 正則化画像の数: 0 num batches per epoch / 1epochのバッチ数: 2500 num epochs / epoch数: 1 batch size per device / バッチサイズ: 1 total train batch size (with parallel & distributed & accumulation) / 総バッチサイズ（並列学習、勾配合計含む）: 1 gradient ccumulation steps / 勾配を合計するステップ数 = 1 total optimization steps / 学習ステップ数: 1600 steps: 0%| | 0/1600 [00:00<?, ?it/s] epoch 1/1 Traceback (most recent call last): File "/home/gx/kohya_ss/sd-scripts/train_db.py", line 529, in train(args) File "/home/gx/kohya_ss/sd-scripts/train_db.py", line 343, in train encoder_hidden_states = train_util.get_hidden_states( File "/home/gx/kohya_ss/sd-scripts/library/train_util.py", line 4427, in get_hidden_states encoder_hidden_states = text_encoder.text_model.final_layer_norm(encoder_hidden_states) File "/home/gx/anaconda3/envs/ss/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1688, in getattr raise AttributeError(f"'{type(self).name}' object has no attribute '{name}'") AttributeError: 'DistributedDataParallel' object has no attribute 'text_model' steps: 0%| | 0/1600 [00:00<?, ?it/s] [2024-06-07 20:34:05,896] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 108617) of binary: /home/gx/anaconda3/envs/ss/bin/python Traceback (most recent call last): File "/home/gx/anaconda3/envs/ss/bin/accelerate", line 8, in sys.exit(main()) File "/home/gx/anaconda3/envs/ss/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main args.func(args) File "/home/gx/anaconda3/envs/ss/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1008, in launch_command multi_gpu_launcher(args) File "/home/gx/anaconda3/envs/ss/lib/python3.10/site-packages/accelerate/commands/launch.py", line 666, in multi_gpu_launcher distrib_run.run(args) File "/home/gx/anaconda3/envs/ss/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run elastic_launch( File "/home/gx/anaconda3/envs/ss/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/gx/anaconda3/envs/ss/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

/home/gx/kohya_ss/sd-scripts/train_db.py FAILED

Failures:

------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-06-07_20:34:05 host : 6c0f6c68e59b rank : 0 (local_rank: 0) exitcode : 1 (pid: 108617) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ 20:34:07-301036 INFO Training has ended. How to solve the problem ,thanks

riffmaster-2001 commented 4 days ago

same issue for me, were you able to get it working? I have 2 A6000 cards.

riffmaster-2001 commented 4 days ago

chatGPT to the rescue.... change line 4427 in train_util.py

 # OLD
        # encoder_hidden_states = text_encoder.text_model.final_layer_norm(encoder_hidden_states)
        #NEW
        encoder_hidden_states = text_encoder.module.text_model.final_layer_norm(encoder_hidden_states)

bmaltais / kohya_ss

AttributeError: 'DistributedDataParallel' object has no attribute 'text_model' #2572

/home/gx/kohya_ss/sd-scripts/train_db.py FAILED