阿里云用sdxl训练模型的时候会报 died with <Signals.SIGKILL: 9>

kbaicai commented 8 months ago

使用dsw和独立gpu云服务器都会报这个错误？是资源占用太大了吗？ Traceback (most recent call last): File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 989, in main() File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 985, in main launch_command(args) File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 979, in launch_command simple_launcher(args) File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 628, in simple_launcher raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) subprocess.CalledProcessError: Command '['/usr/bin/python', '/mnt/workspace/demos/stable_diffusion_easyphoto_festival/stable-diffusion-webui/extensions/sd-webui-EasyPhoto/scripts/train_kohya/train_lora_sd_XL.py', '--pretrained_model_name_or_path=/mnt/workspace/demos/stable_diffusion_easyphoto_festival/stable-diffusion-webui/extensions/sd-webui-EasyPhoto/models/stable-diffusion-xl/stabilityai_stable_diffusion_xl_base_1.0', '--pretrained_model_ckpt=/mnt/workspace/demos/stable_diffusion_easyphoto_festival/stable-diffusion-webui/models/Stable-diffusion/SDXL_1.0_ArienMixXL_v2.0.safetensors', '--train_data_dir=/mnt/workspace/demos/stable_diffusion_easyphoto_festival/stable-diffusion-webui/outputs/easyphoto-user-id-infos/cc_sdxl_9pic/processed_images', '--caption_column=text', '--resolution=1024', '--random_flip', '--train_batch_size=1', '--gradient_accumulation_steps=4', '--dataloader_num_workers=16', '--max_train_steps=600', '--checkpointing_steps=100', '--learning_rate=0.0001', '--lr_scheduler=constant', '--lr_warmup_steps=0', '--train_text_encoder', '--seed=681878', '--rank=32', '--network_alpha=16', '--validation_prompt=easyphoto_face, easyphoto, 1person', '--validation_steps=100', '--output_dir=/mnt/workspace/demos/stable_diffusion_easyphoto_festival/stable-diffusion-webui/outputs/easyphoto-user-id-infos/cc_sdxl_9pic/user_weights', '--logging_dir=/mnt/workspace/demos/stable_diffusion_easyphoto_festival/stable-diffusion-webui/outputs/easyphoto-user-id-infos/cc_sdxl_9pic/user_weights', '--enable_xformers_memory_efficient_attention', '--mixed_precision=fp16', '--template_dir=/mnt/workspace/demos/stable_diffusion_easyphoto_festival/stable-diffusion-webui/extensions/sd-webui-EasyPhoto/models/training_templates', '--template_mask', '--merge_best_lora_based_face_id', '--merge_best_lora_name=cc_sdxl_9pic', '--cache_log_file=/mnt/workspace/demos/stable_diffusion_easyphoto_festival/stable-diffusion-webui/outputs/easyphoto-tmp/train_kohya_log.txt', '--original_config=/mnt/workspace/demos/stable_diffusion_easyphoto_festival/stable-diffusion-webui/repositories/generative-models/configs/inference/sd_xl_base.yaml', '--pretrained_vae_model_name_or_path=/mnt/workspace/demos/stable_diffusion_easyphoto_festival/stable-diffusion-webui/extensions/sd-webui-EasyPhoto/models/stable-diffusion-xl/madebyollin_sdxl_vae_fp16_fix']' died with <Signals.SIGKILL: 9>. Error executing the command: Command '['/usr/bin/python', '-m', 'accelerate.commands.launch', '--mixed_precision=fp16', '--main_process_port=3456', '/mnt/workspace/demos/stable_diffusion_easyphoto_festival/stable-diffusion-webui/extensions/sd-webui-EasyPhoto/scripts/train_kohya/train_lora_sd_XL.py', '--pretrained_model_name_or_path=/mnt/workspace/demos/stable_diffusion_easyphoto_festival/stable-diffusion-webui/extensions/sd-webui-EasyPhoto/models/stable-diffusion-xl/stabilityai_stable_diffusion_xl_base_1.0', '--pretrained_model_ckpt=/mnt/workspace/demos/stable_diffusion_easyphoto_festival/stable-diffusion-webui/models/Stable-diffusion/SDXL_1.0_ArienMixXL_v2.0.safetensors', '--train_data_dir=/mnt/workspace/demos/stable_diffusion_easyphoto_festival/stable-diffusion-webui/outputs/easyphoto-user-id-infos/cc_sdxl_9pic/processed_images', '--caption_column=text', '--resolution=1024', '--random_flip', '--train_batch_size=1', '--gradient_accumulation_steps=4', '--dataloader_num_workers=16', '--max_train_steps=600', '--checkpointing_steps=100', '--learning_rate=0.0001', '--lr_scheduler=constant', '--lr_warmup_steps=0', '--train_text_encoder', '--seed=681878', '--rank=32', '--network_alpha=16', '--validation_prompt=easyphoto_face, easyphoto, 1person', '--validation_steps=100', '--output_dir=/mnt/workspace/demos/stable_diffusion_easyphoto_festival/stable-diffusion-webui/outputs/easyphoto-user-id-infos/cc_sdxl_9pic/user_weights', '--logging_dir=/mnt/workspace/demos/stable_diffusion_easyphoto_festival/stable-diffusion-webui/outputs/easyphoto-user-id-infos/cc_sdxl_9pic/user_weights', '--enable_xformers_memory_efficient_attention', '--mixed_precision=fp16', '--template_dir=/mnt/workspace/demos/stable_diffusion_easyphoto_festival/stable-diffusion-webui/extensions/sd-webui-EasyPhoto/models/training_templates', '--template_mask', '--merge_best_lora_based_face_id', '--merge_best_lora_name=cc_sdxl_9pic', '--cache_log_file=/mnt/workspace/demos/stable_diffusion_easyphoto_festival/stable-diffusion-webui/outputs/easyphoto-tmp/train_kohya_log.txt', '--original_config=/mnt/workspace/demos/stable_diffusion_easyphoto_festival/stable-diffusion-webui/repositories/generative-models/configs/inference/sd_xl_base.yaml', '--pretrained_vae_model_name_or_path=/mnt/workspace/demos/stable_diffusion_easyphoto_festival/stable-diffusion-webui/extensions/sd-webui-EasyPhoto/models/stable-diffusion-xl/madebyollin_sdxl_vae_fp16_fix']' returned non-zero exit status 1.

hkunzhe commented 8 months ago

@kbaicai, Can you checkout https://github.com/aigc-apps/sd-webui-EasyPhoto/pull/397? Set batch_size=4 and gradient_accumulation_steps=1 need 16G VRAM in SDXL training.

kbaicai commented 8 months ago

@kbaicai, Can you checkout #397? Set batch_size=4 and gradient_accumulation_steps=1 need 16G VRAM in SDXL training. I've tried this PR (https://github.com/kbaicai/sd-webui-EasyPhoto_testkilled), but it's not working on T4 Instance. My parameters are rank = 16, network alpha=8, num_workers = 0, gradient_accumulation_steps=1, batch_size=4. It seems that the training process was killed by the cloud service provider (due to high CPU usage?).

zouxinyi0625 commented 8 months ago

@kbaicai, Can you checkout #397? Set batch_size=4 and gradient_accumulation_steps=1 need 16G VRAM in SDXL training. I've tried this PR (https://github.com/kbaicai/sd-webui-EasyPhoto_testkilled), but it's not working on T4 Instance. My parameters are rank = 16, network alpha=8, num_workers = 0, gradient_accumulation_steps=1, batch_size=4. It seems that the training process was killed by the cloud service provider (due to high CPU usage?).

Is the cpu memory usage nearly 100% (show in the right-up corner) or please use a larger CPU memory.

hkunzhe commented 8 months ago

SD WebUI consumes a lot of memory, and the memory usage increases after multiple inferences, while SDXL training does not occupy a particularly large amount of memory.

aigc-apps / sd-webui-EasyPhoto

阿里云用sdxl训练模型的时候会报 died with <Signals.SIGKILL: 9> #401