bmaltais / kohya_ss

Apache License 2.0
9.32k stars 1.2k forks source link

returned non-zero exit status 3221225477. #2684

Open K1LL3RPUNCH opened 1 month ago

K1LL3RPUNCH commented 1 month ago

Loading pipeline components...: 100%|████████████████████████████████████████████████████| 5/5 [00:00<00:00, 16.55it/s] You have disabled the safety checker for <class 'diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline'> by passing safety_checker=None. Ensure that you abide to the conditions of the Stable Diffusion license and do not expose unfiltered results in services or applications open to the public. Both the diffusers team and Hugging Face strongly recommend to keep the safety filter enabled in all public facing circumstances, disabling it only for use-cases that involve analyzing network behavior or auditing its results. For more information, please have a look at https://github.com/huggingface/diffusers/pull/254 . INFO UNet2DConditionModel: 64, 8, 768, False, False original_unet.py:1387 Traceback (most recent call last): File "C:\Users\inkvi\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "C:\Users\inkvi\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in _run_code exec(code, run_globals) File "D:\StableDiffusion\stable-diffusion-webui\extensions\kohya_ss\venv\Scripts\accelerate.EXE__main__.py", line 7, in File "D:\StableDiffusion\stable-diffusion-webui\extensions\kohya_ss\venv\lib\site-packages\accelerate\commands\accelerate_cli.py", line 47, in main args.func(args) File "D:\StableDiffusion\stable-diffusion-webui\extensions\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 1017, in launch_command simple_launcher(args) File "D:\StableDiffusion\stable-diffusion-webui\extensions\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 637, in simple_launcher raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) subprocess.CalledProcessError: Command '['D:\StableDiffusion\stable-diffusion-webui\extensions\kohya_ss\venv\Scripts\python.exe', 'D:/StableDiffusion/stable-diffusion-webui/extensions/kohya_ss/sd-scripts/train_network.py', '--config_file', 'D:/StableDiffusion/trained/config_lora-20240802-201633.toml']' returned non-zero exit status 3221225477. 20:16:50-267285 INFO Training has ended.

K1LL3RPUNCH commented 1 month ago

20:20:44-548897 INFO bucket_no_upscale = true bucket_reso_steps = 64 cache_latents = true caption_extension = ".txt" clip_skip = 1 dynamo_backend = "no" enable_bucket = true epoch = 40 gradient_accumulation_steps = 1 huber_c = 0.1 huber_schedule = "snr" learning_rate = 0.0001 loss_type = "l2" lr_scheduler = "cosine" lr_scheduler_args = [] lr_scheduler_num_cycles = 1 lr_scheduler_power = 1 lr_warmup_steps = 200 max_bucket_reso = 2048 max_data_loader_n_workers = 0 max_grad_norm = 1 max_timestep = 1000 max_token_length = 75 max_train_steps = 2000 min_bucket_reso = 256 mixed_precision = "no" multires_noise_discount = 0.3 network_alpha = 1 network_args = [] network_dim = 8 network_module = "networks.lora" noise_offset_type = "Original" optimizer_args = [] optimizer_type = "Adafactor" output_dir = "D:/StableDiffusion/trained" output_name = "spikeschaffer" pretrained_model_name_or_path = "runwayml/stable-diffusion-v1-5" prior_loss_weight = 1 resolution = "768" sample_prompts = "D:/StableDiffusion/trained\prompt.txt" sample_sampler = "k_dpm_2_a" save_every_n_epochs = 3 save_model_as = "safetensors" save_precision = "fp16" text_encoder_lr = 0.0001 train_batch_size = 2 train_data_dir = "D:/Spike Schaffer" unet_lr = 0.0001 xformers = true

20:20:44-552897 INFO end of toml config file: D:/StableDiffusion/trained/config_lora-20240802-202044.toml

trick72 commented 1 month ago

Hi, I got exact same issue. This traceback happens during training, after some epochs are finished, always at different times during the training. Sometimes soon, sometimes after an hour. There is no consistancy when it happens. For me it started to happen after I upgraded to the latest NVIDIA driver.

trick72 commented 1 month ago

when the training crashes, there is an event logged in Windows Application event viewer:

Faulting application name: python.exe, version: 3.10.9150.1013, time stamp: 0x638fa05d Faulting module name: nvcuda64.dll, version: 32.0.15.6070, time stamp: 0x668eca2c Exception code: 0xc0000005 Fault offset: 0x000000000002d13a Faulting process ID: 0x2668 Faulting application start time: 0x01dae709114af2ab Faulting application path: C:\Program Files\Python310\python.exe Faulting module path: C:\Windows\system32\DriverStore\FileRepository\nv_dispi.inf_amd64_1196b342b24df5d1\nvcuda64.dll Report ID: c6dcbbf6-5d67-42e1-824b-f6f3546b49d5 Faulting package full name: Faulting package-relative application ID: