Closed SkymillRobbie closed 1 year ago
Try using AdamW8bit if your card support it... and maybe going to a train batch size of 1.
I set it as AdamW8bit& batch size of 1,But it still have same promblem,sorry.
Have you tried with AdamW instead?
My mistake! I realized I was actually using the "Dreambooth" tab instead of the "Dreambooth LoRA" tab. It works great now!
Currently trying to train a lora using RTX 2080Ti with 11G of VRAM. I'm only using 10 images and they are all 512x512 resolution.
Whenever I run it though I get this error and it never completes.
Any idea how I can resolve this?
`prepare tokenizer prepare images. found directory M:\Games\Rax\Stable Diffusion\Kohya\Process Lora\zeraora lora\img\150_zeraora contains 10 image files 1500 train images with repeating. 0 reg images. no regularization images / 正則化画像が見つかりませんでした [Dataset 0] batch_size: 2 resolution: (512, 512) enable_bucket: True min_bucket_reso: 256 max_bucket_reso: 1024 bucket_reso_steps: 64 bucket_no_upscale: True
[Subset 0 of Dataset 0] image_dir: "M:\Games\Rax\Stable Diffusion\Kohya\Process Lora\zeraora lora\img\150_zeraora" image_count: 10 num_repeats: 150 shuffle_caption: False keep_tokens: 0 caption_dropout_rate: 0.0 caption_dropout_every_n_epoches: 0 caption_tag_dropout_rate: 0.0 color_aug: False flip_aug: False face_crop_aug_range: None random_crop: False token_warmup_min: 1, token_warmup_step: 0, is_reg: False class_tokens: zeraora caption_extension: .txt
[Dataset 0] loading image sizes. 100%|████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 1428.38it/s] make buckets min_bucket_reso and max_bucket_reso are ignored if bucket_no_upscale is set, because bucket reso is defined by image size automatically / bucket_no_upscaleが指定された場合は、bucketの解像度は画像サイズから自動計算されるため、min_bucket_resoとmax_bucket_resoは無視されます number of images (including repeats) / 各bucketの画像枚数(繰り返し回数を含む) bucket 0: resolution (512, 512), count: 1500 mean ar error (without repeats): 0.0 prepare accelerator Using accelerator 0.15.0 or above. load StableDiffusion checkpoint loading u-net:
loading vae:
loading text encoder:
Replace CrossAttention.forward to use xformers
[Dataset 0]
caching latents.
100%|██████████████████████████████████████████████████████████████████████████████████| 10/10 [00:02<00:00, 3.77it/s]
prepare optimizer, data loader etc.
use AdamW optimizer | {}
running training / 学習開始
num train images repeats / 学習画像の数×繰り返し回数: 1500
num reg images / 正則化画像の数: 0
num batches per epoch / 1epochのバッチ数: 750
num epochs / epoch数: 1
batch size per device / バッチサイズ: 2
total train batch size (with parallel & distributed & accumulation) / 総バッチサイズ(並列学習、勾配合計含む): 2
gradient ccumulation steps / 勾配を合計するステップ数 = 1
total optimization steps / 学習ステップ数: 750
steps: 0%| | 0/750 [00:00<?, ?it/s]epoch 1/1
Traceback (most recent call last):
File "M:\Games\Rax\Stable Diffusion\Kohya\kohya_ss\train_db.py", line 429, in
train(args)
File "M:\Games\Rax\Stable Diffusion\Kohya\kohya_ss\train_db.py", line 317, in train
optimizer.step()
File "C:\Users\Redd\AppData\Local\Programs\Python\Python310\lib\site-packages\accelerate\optimizer.py", line 134, in step
self.scaler.step(self.optimizer, closure)
File "C:\Users\Redd\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\cuda\amp\grad_scaler.py", line 338, in step
retval = self._maybe_opt_step(optimizer, optimizer_state, args, kwargs)
File "C:\Users\Redd\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\cuda\amp\grad_scaler.py", line 285, in _maybe_opt_step
retval = optimizer.step(*args, *kwargs)
File "C:\Users\Redd\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\optim\lr_scheduler.py", line 65, in wrapper
return wrapped(args, kwargs)
File "C:\Users\Redd\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\optim\optimizer.py", line 113, in wrapper
return func(*args, *kwargs)
File "C:\Users\Redd\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\autograd\grad_mode.py", line 27, in decorate_context
return func(args, **kwargs)
File "C:\Users\Redd\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\optim\adamw.py", line 148, in step
state['exp_avg_sq'] = torch.zeros_like(p, memory_format=torch.preserve_format)
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
steps: 0%| | 0/750 [00:08<?, ?it/s]
Traceback (most recent call last):
File "C:\Users\Redd\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\Users\Redd\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in _run_code
exec(code, run_globals)
File "C:\Users\Redd\AppData\Local\Programs\Python\Python310\Scripts\accelerate.exe__main__.py", line 7, in File "C:\Users\Redd\AppData\Local\Programs\Python\Python310\lib\site-packages\accelerate\commands\accelerate_cli.py", line 45, in main
args.func(args)
File "C:\Users\Redd\AppData\Local\Programs\Python\Python310\lib\site-packages\accelerate\commands\launch.py", line 1104, in launch_command
simple_launcher(args)
File "C:\Users\Redd\AppData\Local\Programs\Python\Python310\lib\site-packages\accelerate\commands\launch.py", line 567, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['C:\Users\Redd\AppData\Local\Programs\Python\Python310\python.exe', 'train_db.py', '--enable_bucket', '--pretrained_model_name_or_path=M:/Games/Rax/Stable Diffusion/stable-diffusion-webui/models/Stable-diffusion/AbyssOrangeMix2_hard.safetensors', '--train_data_dir=M:\Games\Rax\Stable Diffusion\Kohya\Process Lora\zeraora lora\img', '--resolution=512,512', '--output_dir=M:\Games\Rax\Stable Diffusion\Kohya\Process Lora\zeraora lora\model', '--logging_dir=M:\Games\Rax\Stable Diffusion\Kohya\Process Lora\zeraora lora\log', '--save_model_as=safetensors', '--output_name=zeraora', '--max_data_loader_n_workers=1', '--learning_rate=0.0001', '--lr_scheduler=constant', '--train_batch_size=2', '--max_train_steps=750', '--save_every_n_epochs=1', '--mixed_precision=fp16', '--save_precision=fp16', '--seed=1234', '--caption_extension=.txt', '--cache_latents', '--optimizer_type=AdamW', '--max_data_loader_n_workers=1', '--clip_skip=2', '--bucket_reso_steps=64', '--xformers', '--bucket_no_upscale']' returned non-zero exit status 1.
`