训练到100步就报错

rain9726 commented 10 months ago

024-01-09 08:43:20,825 - EasyPhoto - train_file_path : /root/autodl-tmp/stable-diffusion-webui/extensions/sd-webui-EasyPhoto/scripts/train_kohya/train_lora.py 2024-01-09 08:43:20,826 - EasyPhoto - cache_log_file_path: /root/autodl-tmp/stable-diffusion-webui/outputs/easyphoto-tmp/train_kohya_log.txt Error. nthreads cannot be larger than environment variable "NUMEXPR_MAX_THREADS" (8)The following values were not passed to accelerate launch and had defaults used instead: --num_processes was set to a value of 1 --num_machines was set to a value of 1 --dynamo_backend was set to a value of 'no' To avoid this warning pass in values for each of the problematic parameters or run accelerate config. 2024-01-09 08:43:32,619 - modelscope - INFO - PyTorch version 2.0.1+cu118 Found. 2024-01-09 08:43:32,621 - modelscope - INFO - TensorFlow version 2.12.0 Found. 2024-01-09 08:43:32,621 - modelscope - INFO - Loading ast index from /root/.cache/modelscope/ast_indexer 2024-01-09 08:43:32,650 - modelscope - INFO - Loading done! Current index file version is 1.9.3, with md5 ce52a1517bab79727e198f27c93177a5 and a total number of 943 components indexed 01/09/2024 08:43:33 - INFO - main - Distributed environment: NO Num processes: 1 Process index: 0 Local process index: 0 Device: cuda

Mixed precision type: fp16

{'dynamic_thresholding_ratio', 'timestep_spacing', 'sample_max_value', 'clip_sample_range', 'thresholding', 'variance_type', 'rescale_betas_zero_snr', 'prediction_type'} was not found in config. Values will be initialized to default values. UNet2DConditionModel: 64, 8, 768, False, False loading u-net: loading vae: loading text encoder: create LoRA network. base dim (rank): 128, alpha: 64 neuron dropout: p=None, rank dropout: p=None, module dropout: p=None create LoRA for Text Encoder: create LoRA for Text Encoder: 72 modules. create LoRA for U-Net: 192 modules. enable LoRA for text encoder enable LoRA for U-Net Resolving data files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 31/31 [00:00<00:00, 67230.31it/s] Downloading and preparing dataset imagefolder/default to /root/.cache/huggingface/datasets/imagefolder/default-5f1bdc0016d9699e/0.0.0/37fbb85cc714a338bea574ac6c7d0b5be5aff46c1862c1989b20e0771199e93f... Downloading data files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 16/16 [00:00<00:00, 65664.25it/s] Downloading data files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 15/15 [00:00<00:00, 68909.70it/s] Extracting data files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 15/15 [00:00<00:00, 4628.11it/s] Dataset imagefolder downloaded and prepared to /root/.cache/huggingface/datasets/imagefolder/default-5f1bdc0016d9699e/0.0.0/37fbb85cc714a338bea574ac6c7d0b5be5aff46c1862c1989b20e0771199e93f. Subsequent calls will reuse this data. 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 805.36it/s] 01/09/2024 08:43:46 - INFO - main - Running training 01/09/2024 08:43:46 - INFO - main - Num examples = 15 01/09/2024 08:43:46 - INFO - main - Num Epochs = 200 01/09/2024 08:43:46 - INFO - main - Instantaneous batch size per device = 1 01/09/2024 08:43:46 - INFO - main - Total train batch size (w. parallel, distributed & accumulation) = 4 01/09/2024 08:43:46 - INFO - main - Gradient Accumulation steps = 4 01/09/2024 08:43:46 - INFO - main - Total optimization steps = 800 Steps: 0%| | 0/800 [00:00<?, ?it/s]2024-01-09 08:43:46,805 - modelscope - INFO - Model revision not specified, use revision: v2.0.2 2024-01-09 08:43:48,800 - modelscope - INFO - initiate model from /root/.cache/modelscope/hub/damo/cv_resnet50_face-detection_retinaface 2024-01-09 08:43:48,800 - modelscope - INFO - initiate model from location /root/.cache/modelscope/hub/damo/cv_resnet50_face-detection_retinaface. 2024-01-09 08:43:48,802 - modelscope - WARNING - No preprocessor field found in cfg. 2024-01-09 08:43:48,802 - modelscope - WARNING - No val key and type key found in preprocessor domain of configuration.json file. 2024-01-09 08:43:48,802 - modelscope - WARNING - Cannot find available config to build preprocessor at mode inference, current config: {'model_dir': '/root/.cache/modelscope/hub/damo/cv_resnet50_face-detection_retinaface'}. trying to build by task and model information. 2024-01-09 08:43:48,802 - modelscope - WARNING - Find task: face-detection, model type: None. Insufficient information to build preprocessor, skip building preprocessor 2024-01-09 08:43:48,803 - modelscope - INFO - loading model from /root/.cache/modelscope/hub/damo/cv_resnet50_face-detection_retinaface/pytorch_model.pt /root/miniconda3/envs/xl_env/lib/python3.10/site-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead. warnings.warn( /root/miniconda3/envs/xl_env/lib/python3.10/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or None for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing weights=None. warnings.warn(msg) 2024-01-09 08:43:49,432 - modelscope - INFO - load model done Steps: 12%|███████████████▏ | 100/800 [03:03<18:26, 1.58s/it, lr=5e-5, step_loss=0.00655] saving checkpoint: /root/autodl-tmp/stable-diffusion-webui/outputs/easyphoto-user-id-infos/22/user_weights/checkpoint-100.safetensors 01/09/2024 08:46:51 - INFO - main - Saved state to /root/autodl-tmp/stable-diffusion-webui/outputs/easyphoto-user-id-infos/22/user_weights/checkpoint-100.safetensors, /root/autodl-tmp/stable-diffusion-webui/outputs/easyphoto-user-id-infos/22/user_weights/checkpoint-100 Steps: 12%|███████████████▍ | 100/800 [03:04<18:26, 1.58s/it, lr=5e-5, step_loss=0.148]01/09/2024 08:46:51 - INFO - main - Running validation... Generating 4 images with prompt: easyphoto_face, easyphoto, 1person. UNet2DConditionModel: 64, 8, 768, False, False loading u-net: loading vae: loading text encoder: You have disabled the safety checker for <class 'diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion_inpaint.StableDiffusionInpaintPipeline'> by passing safety_checker=None. Ensure that you abide to the conditions of the Stable Diffusion license and do not expose unfiltered results in services or applications open to the public. Both the diffusers team and Hugging Face strongly recommend to keep the safety filter enabled in all public facing circumstances, disabling it only for use-cases that involve analyzing network behavior or auditing its results. For more information, please have a look at https://github.com/huggingface/diffusers/pull/254 . You have loaded a UNet with 4 input channels which. {'dynamic_thresholding_ratio', 'timestep_spacing', 'sample_max_value', 'thresholding', 'algorithm_type', 'solver_type', 'lower_order_final', 'variance_type', 'use_karras_sigmas', 'euler_at_final', 'use_lu_lambdas', 'prediction_type', 'solver_order', 'lambda_min_clipped'} was not found in config. Values will be initialized to default values. Traceback (most recent call last): File "/root/autodl-tmp/stable-diffusion-webui/extensions/sd-webui-EasyPhoto/scripts/train_kohya/train_lora.py", line 1370, in main() File "/root/autodl-tmp/stable-diffusion-webui/extensions/sd-webui-EasyPhoto/scripts/train_kohya/utils/gpu_info.py", line 190, in wrapper result = func(*args, kwargs) File "/root/autodl-tmp/stable-diffusion-webui/extensions/sd-webui-EasyPhoto/scripts/train_kohya/train_lora.py", line 1237, in main log_validation( File "/root/autodl-tmp/stable-diffusion-webui/extensions/sd-webui-EasyPhoto/scripts/train_kohya/train_lora.py", line 123, in log_validation image = pipeline( File "/root/miniconda3/envs/xl_env/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, *kwargs) File "/root/miniconda3/envs/xl_env/lib/python3.10/site-packages/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_inpaint.py", line 1349, in call noise_pred = self.unet( File "/root/miniconda3/envs/xl_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(args, kwargs) TypeError: UNet2DConditionModel.forward() got an unexpected keyword argument 'added_cond_kwargs' Steps: 12%|███████████████▍ | 100/800 [03:17<23:01, 1.97s/it, lr=5e-5, step_loss=0.148] Traceback (most recent call last): File "/root/miniconda3/envs/xl_env/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/root/miniconda3/envs/xl_env/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/root/miniconda3/envs/xl_env/lib/python3.10/site-packages/accelerate/commands/launch.py", line 989, in main() File "/root/miniconda3/envs/xl_env/lib/python3.10/site-packages/accelerate/commands/launch.py", line 985, in main launch_command(args) File "/root/miniconda3/envs/xl_env/lib/python3.10/site-packages/accelerate/commands/launch.py", line 979, in launch_command simple_launcher(args) File "/root/miniconda3/envs/xl_env/lib/python3.10/site-packages/accelerate/commands/launch.py", line 628, in simple_launcher raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) subprocess.CalledProcessError: Command '['/root/miniconda3/envs/xl_env/bin/python', '/root/autodl-tmp/stable-diffusion-webui/extensions/sd-webui-EasyPhoto/scripts/train_kohya/train_lora.py', '--pretrained_model_name_or_path=/root/autodl-tmp/stable-diffusion-webui/extensions/sd-webui-EasyPhoto/models/stable-diffusion-v1-5', '--pretrained_model_ckpt=/root/autodl-tmp/stable-diffusion-webui/models/Stable-diffusion/Chilloutmix-Ni-pruned-fp16-fix.safetensors', '--train_data_dir=/root/autodl-tmp/stable-diffusion-webui/outputs/easyphoto-user-id-infos/22/processed_images', '--caption_column=text', '--resolution=512', '--random_flip', '--train_batch_size=1', '--gradient_accumulation_steps=4', '--dataloader_num_workers=16', '--max_train_steps=800', '--checkpointing_steps=100', '--learning_rate=0.0001', '--lr_scheduler=constant', '--lr_warmup_steps=0', '--train_text_encoder', '--seed=782704', '--rank=128', '--network_alpha=64', '--validation_prompt=easyphoto_face, easyphoto, 1person', '--validation_steps=100', '--output_dir=/root/autodl-tmp/stable-diffusion-webui/outputs/easyphoto-user-id-infos/22/user_weights', '--logging_dir=/root/autodl-tmp/stable-diffusion-webui/outputs/easyphoto-user-id-infos/22/user_weights', '--enable_xformers_memory_efficient_attention', '--mixed_precision=fp16', '--template_dir=/root/autodl-tmp/stable-diffusion-webui/extensions/sd-webui-EasyPhoto/models/training_templates', '--template_mask', '--merge_best_lora_based_face_id', '--merge_best_lora_name=22', '--cache_log_file=/root/autodl-tmp/stable-diffusion-webui/outputs/easyphoto-tmp/train_kohya_log.txt', '--validation']' returned non-zero exit status 1. 2024-01-09 08:47:05,766 - EasyPhoto - Error executing the command: Command '['/root/miniconda3/envs/xl_env/bin/python', '-m', 'accelerate.commands.launch', '--mixed_precision=fp16', '--main_process_port=3456', '/root/autodl-tmp/stable-diffusion-webui/extensions/sd-webui-EasyPhoto/scripts/train_kohya/train_lora.py', '--pretrained_model_name_or_path=/root/autodl-tmp/stable-diffusion-webui/extensions/sd-webui-EasyPhoto/models/stable-diffusion-v1-5', '--pretrained_model_ckpt=/root/autodl-tmp/stable-diffusion-webui/models/Stable-diffusion/Chilloutmix-Ni-pruned-fp16-fix.safetensors', '--train_data_dir=/root/autodl-tmp/stable-diffusion-webui/outputs/easyphoto-user-id-infos/22/processed_images', '--caption_column=text', '--resolution=512', '--random_flip', '--train_batch_size=1', '--gradient_accumulation_steps=4', '--dataloader_num_workers=16', '--max_train_steps=800', '--checkpointing_steps=100', '--learning_rate=0.0001', '--lr_scheduler=constant', '--lr_warmup_steps=0', '--train_text_encoder', '--seed=782704', '--rank=128', '--network_alpha=64', '--validation_prompt=easyphoto_face, easyphoto, 1person', '--validation_steps=100', '--output_dir=/root/autodl-tmp/stable-diffusion-webui/outputs/easyphoto-user-id-infos/22/user_weights', '--logging_dir=/root/autodl-tmp/stable-diffusion-webui/outputs/easyphoto-user-id-infos/22/user_weights', '--enable_xformers_memory_efficient_attention', '--mixed_precision=fp16', '--template_dir=/root/autodl-tmp/stable-diffusion-webui/extensions/sd-webui-EasyPhoto/models/training_templates', '--template_mask', '--merge_best_lora_based_face_id', '--merge_best_lora_name=22', '--cache_log_file=/root/autodl-tmp/stable-diffusion-webui/outputs/easyphoto-tmp/train_kohya_log.txt', '--validation']' returned non-zero exit status 1. Applying attention optimization: xformers... done.

请作者帮忙看看是什么问题引起的

XMUykyz commented 10 months ago

I also encountered the same problem

hkunzhe commented 10 months ago

@rain9726 and @XMUykyz, It looks like the error was caused by the validation. Since we have added the exception handling for validation errors in the latest version. Could you please turn off the validation in the training UI or upgrade EasyPhoto to the latest version?

File "/root/autodl-tmp/stable-diffusion-webui/extensions/sd-webui-EasyPhoto/scripts/train_kohya/train_lora.py", line 1237, in main
log_validation(
File "/root/autodl-tmp/stable-diffusion-webui/extensions/sd-webui-EasyPhoto/scripts/train_kohya/train_lora.py", line 123, in log_validation
image = pipeline(
File "/root/miniconda3/envs/xl_env/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)

From the line number in the log, I can confirm you are not using the latest EasyPhoto. https://github.com/aigc-apps/sd-webui-EasyPhoto/commit/41b68d6e5f1b13b523ad9599ebc42048f7327a13#diff-801b9d852d2dfd58f3feee7db12fdc9554f8b73e4c581952d05408a62eb6a507L1237.

wuziheng commented 10 months ago

Turn Off Validation on training UI and restart

rain9726 commented 10 months ago

更新了新版好了谢谢

aigc-apps / sd-webui-EasyPhoto

训练到100步就报错 #365