bmaltais / kohya_ss

Apache License 2.0
9.37k stars 1.21k forks source link

[Error] Error during training #1599

Closed cat0318 closed 7 months ago

cat0318 commented 11 months ago

I am a newbie in AI training. When I did my first lora training, I was unable to continue due to an error.Graphics card is 4090. Thanks for your help. [Error]:

[Dataset 0] loading image sizes. 100%|████████████████████████████████████████████████████████████████████████████████| 18/18 [00:00<00:00, 4514.86it/s] make buckets min_bucket_reso and max_bucket_reso are ignored if bucket_no_upscale is set, because bucket reso is defined by image size automatically / bucket_no_upscaleが指定された場合は、bucketの解像度は画像サイズから自動計算されるため、min_bucket_resoとmax_bucket_resoは無視されます number of images (including repeats) / 各bucketの画像枚数(繰り返し回数を含む) bucket 0: resolution (128, 192), count: 40 bucket 1: resolution (192, 192), count: 80 bucket 2: resolution (192, 256), count: 80 bucket 3: resolution (256, 192), count: 40 bucket 4: resolution (256, 320), count: 40 bucket 5: resolution (384, 512), count: 40 bucket 6: resolution (384, 576), count: 160 bucket 7: resolution (448, 512), count: 40 bucket 8: resolution (448, 576), count: 40 bucket 9: resolution (512, 448), count: 40 bucket 10: resolution (640, 384), count: 120 mean ar error (without repeats): 0.0708142223913111 prepare accelerator loading model for process 0/1 load Diffusers pretrained models: runwayml/stable-diffusion-v1-5 text_encoder\model.safetensors not found Loading pipeline components...: 100%|████████████████████████████████████████████████████| 5/5 [00:01<00:00, 3.45it/s] You have disabled the safety checker for <class 'diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline'> by passing safety_checker=None. Ensure that you abide to the conditions of the Stable Diffusion license and do not expose unfiltered results in services or applications open to the public. Both the diffusers team and Hugging Face strongly recommend to keep the safety filter enabled in all public facing circumstances, disabling it only for use-cases that involve analyzing network behavior or auditing its results. For more information, please have a look at https://github.com/huggingface/diffusers/pull/254 . UNet2DConditionModel: 64, 8, 768, False, False U-Net converted to original U-Net Enable xformers for U-Net [Dataset 0] caching latents. checking cache validity... 100%|████████████████████████████████████████████████████████████████████████████████| 18/18 [00:00<00:00, 1806.07it/s] caching latents... 0it [00:00, ?it/s] prepare optimizer, data loader etc.

===================================BUG REPORT=================================== Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues For effortless bug reporting copy-paste your error into this form: https://docs.google.com/forms/d/e/1FAIpQLScPB8emS3Thkp66nvqwmjTEgxp8Y9ufuWTzFyr9kJ5AoI47dQ/viewform?usp=sf_link

CUDA SETUP: Loading binary E:\kohya_ss\venv\lib\site-packages\bitsandbytes\libbitsandbytes_cuda116.dll... use 8-bit AdamW optimizer | {} Traceback (most recent call last): File "E:\kohya_ss\train_db.py", line 488, in train(args) File "E:\kohya_ss\train_db.py", line 212, in train unet, text_encoder, optimizer, train_dataloader, lr_scheduler = accelerator.prepare( File "E:\kohya_ss\venv\lib\site-packages\accelerate\accelerator.py", line 1284, in prepare result = tuple( File "E:\kohya_ss\venv\lib\site-packages\accelerate\accelerator.py", line 1285, in self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement) File "E:\kohya_ss\venv\lib\site-packages\accelerate\accelerator.py", line 1090, in _prepare_one return self.prepare_model(obj, device_placement=device_placement) File "E:\kohya_ss\venv\lib\site-packages\accelerate\accelerator.py", line 1489, in prepare_model model = torch.compile(model, **self.state.dynamo_plugin.to_kwargs()) File "E:\kohya_ss\venv\lib\site-packages\torch__init.py", line 1441, in compile return torch._dynamo.optimize(backend=backend, nopython=fullgraph, dynamic=dynamic, disable=disable)(model) File "E:\kohya_ss\venv\lib\site-packages\torch_dynamo\eval_frame.py", line 413, in optimize check_if_dynamo_supported() File "E:\kohya_ss\venv\lib\site-packages\torch_dynamo\eval_frame.py", line 375, in check_if_dynamo_supported raise RuntimeError("Windows not yet supported for torch.compile") RuntimeError: Windows not yet supported for torch.compile Traceback (most recent call last): File "C:\Users\hugua\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "C:\Users\hugua\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in _run_code exec(code, run_globals) File "E:\kohya_ss\venv\Scripts\accelerate.exe\main__.py", line 7, in File "E:\kohya_ss\venv\lib\site-packages\accelerate\commands\accelerate_cli.py", line 47, in main args.func(args) File "E:\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 986, in launch_command simple_launcher(args) File "E:\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 628, in simple_launcher raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) subprocess.CalledProcessError: Command '['E:\kohya_ss\venv\Scripts\python.exe', './train_db.py', '--enable_bucket', '--min_bucket_reso=256', '--max_bucket_reso=2048', '--pretrained_model_name_or_path=runwayml/stable-diffusion-v1-5', '--train_data_dir=E:/loratest/image', '--resolution=512,512', '--output_dir=E:/loratest/model', '--logging_dir=E:/loratest/log', '--save_model_as=safetensors', '--output_name=Clash', '--lr_scheduler_num_cycles=3', '--max_data_loader_n_workers=0', '--learning_rate=0.0001', '--lr_scheduler=cosine_with_restarts', '--lr_warmup_steps=216', '--train_batch_size=1', '--max_train_steps=2160', '--save_every_n_epochs=1', '--mixed_precision=fp16', '--save_precision=fp16', '--caption_extension=.txt', '--cache_latents', '--cache_latents_to_disk', '--optimizer_type=AdamW8bit', '--max_data_loader_n_workers=0', '--bucket_reso_steps=64', '--min_snr_gamma=10', '--xformers', '--bucket_no_upscale', '--multires_noise_iterations=8', '--multires_noise_discount=0.2']' returned non-zero exit status 1.

ronbere commented 11 months ago

same

reviewtower commented 11 months ago

Anyone can fix this. I got same issue as above. I thought i still running but no. It's frozen like forever, lol