bmaltais / kohya_ss

Apache License 2.0
9.2k stars 1.19k forks source link

trying to train with intel arc gpu - error when training #1900

Closed ep150de closed 4 months ago

ep150de commented 7 months ago

Sharing traceback below: exit return status 1

14:25:39-679560 INFO Start training LoRA Standard ... 14:25:39-680164 INFO Checking for duplicate image filenames in training data directory... 14:25:39-680628 INFO Valid image folder names found in: /home/demo/pat/Training/img 14:25:39-681033 INFO Folder 40_Patman: 21 images found 14:25:39-681387 INFO Folder 40_Pat man: 840 steps 14:25:39-681703 INFO Total steps: 840 14:25:39-681998 INFO Train batch size: 2 14:25:39-682318 INFO Gradient accumulation steps: 1 14:25:39-682626 INFO Epoch: 4 14:25:39-682913 INFO Regulatization factor: 1 14:25:39-683206 INFO max_train_steps (840 / 2 / 1 4 1) = 1680 14:25:39-683597 INFO stop_text_encoder_training = 0 14:25:39-683914 INFO lr_warmup_steps = 0 14:25:39-684263 INFO Saving training config to /home/demo/pat/Training/model/pat-arc-run1_20240122-142539.json... 14:25:39-684759 INFO accelerate launch --num_cpu_threads_per_process=2 "./sdxl_train_network.py" --enable_bucket --min_bucket_reso=256 --max_bucket_reso=2048 --pretrained_model_name_or_path="stabilityai/stable-diffusion-xl-base-1.0" --train_data_dir="/home/demo/pat/Training/img" --resolution="1024,1024" --output_dir="/home/demo/pat/Training/model" --logging_dir="/home/demo/pat/Training/log" --network_alpha="1" --save_model_as=safetensors --network_module=networks.lora --text_encoder_lr=0.0004 --unet_lr=0.0004 --network_dim=256 --output_name="pat-arc-run1" --lr_scheduler_num_cycles="4" --no_half_vae --learning_rate="0.0004" --lr_scheduler="constant" --train_batch_size="2" --max_train_steps="1680" --save_every_n_epochs="1" --mixed_precision="bf16" --save_precision="bf16" --cache_latents --cache_latents_to_disk --optimizer_type="Adafactor" --optimizer_args scale_parameter=False relative_step=False warmup_init=False --max_grad_norm="1" --max_data_loader_n_workers="0" --bucket_reso_steps=64 --save_state --gradient_checkpointing --sdpa --bucket_no_upscale --noise_offset=0.0 /home/demo/kohya_ss/venv/lib/python3.10/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: ''If you don't plan on using image functionality from torchvision.io, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have libjpeg or libpng installed before building torchvision from source? warn( /home/demo/kohya_ss/venv/lib/python3.10/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: ''If you don't plan on using image functionality from torchvision.io, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have libjpeg or libpng installed before building torchvision from source? warn( 2024-01-22 14:25:43,439 - xformers - WARNING - WARNING[XFORMERS]: xFormers can't load C++/CUDA extensions. xFormers was built for: PyTorch 2.0.1+cu118 with CUDA 1108 (you have 2.1.0a0+cxx11.abi) Python 3.10.12 (you have 3.10.12) Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers) Memory-efficient attention, SwiGLU, sparse and more won't be available. Set XFORMERS_MORE_DETAILS=1 for more details Traceback (most recent call last): File "/home/demo/kohya_ss/venv/lib/python3.10/site-packages/diffusers/utils/import_utils.py", line 710, in _get_module return importlib.import_module("." + module_name, self.name) File "/usr/lib/python3.10/importlib/init.py", line 126, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "", line 1050, in _gcd_import File "", line 1027, in _find_and_load File "", line 992, in _find_and_load_unlocked File "", line 241, in _call_with_frames_removed File "", line 1050, in _gcd_import File "", line 1027, in _find_and_load File "", line 1006, in _find_and_load_unlocked File "", line 688, in _load_unlocked File "", line 883, in exec_module File "", line 241, in _call_with_frames_removed File "/home/demo/kohya_ss/venv/lib/python3.10/site-packages/diffusers/models/autoencoders/init.py", line 1, in from .autoencoder_asym_kl import AsymmetricAutoencoderKL File "/home/demo/kohya_ss/venv/lib/python3.10/site-packages/diffusers/models/autoencoders/autoencoder_asym_kl.py", line 23, in from .vae import DecoderOutput, DiagonalGaussianDistribution, Encoder, MaskConditionDecoder File "/home/demo/kohya_ss/venv/lib/python3.10/site-packages/diffusers/models/autoencoders/vae.py", line 24, in from ..attention_processor import SpatialNorm File "/home/demo/kohya_ss/venv/lib/python3.10/site-packages/diffusers/models/attention_processor.py", line 32, in import xformers.ops File "/home/demo/kohya_ss/venv/lib/python3.10/site-packages/xformers/ops/init.py", line 26, in from .swiglu_op import ( File "/home/demo/kohya_ss/venv/lib/python3.10/site-packages/xformers/ops/swiglu_op.py", line 102, in class _SwiGLUFusedFunc(torch.autograd.Function): File "/home/demo/kohya_ss/venv/lib/python3.10/site-packages/xformers/ops/swiglu_op.py", line 106, in _SwiGLUFusedFunc @torch.cuda.amp.custom_fwd AttributeError: module 'intel_extension_for_pytorch.xpu.amp' has no attribute 'custom_fwd'

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/home/demo/kohya_ss/./sdxl_train_network.py", line 13, in from library import sdxl_model_util, sdxl_train_util, train_util File "/home/demo/kohya_ss/library/sdxl_model_util.py", line 7, in from diffusers import AutoencoderKL, EulerDiscreteScheduler, UNet2DConditionModel File "", line 1075, in _handle_fromlist File "/home/demo/kohya_ss/venv/lib/python3.10/site-packages/diffusers/utils/import_utils.py", line 701, in getattr value = getattr(module, name) File "/home/demo/kohya_ss/venv/lib/python3.10/site-packages/diffusers/utils/import_utils.py", line 700, in getattr module = self._get_module(self._class_to_module[name]) File "/home/demo/kohya_ss/venv/lib/python3.10/site-packages/diffusers/utils/import_utils.py", line 712, in _get_module raise RuntimeError( RuntimeError: Failed to import diffusers.models.autoencoders.autoencoder_kl because of the following error (look up to see its traceback): module 'intel_extension_for_pytorch.xpu.amp' has no attribute 'custom_fwd' Traceback (most recent call last): File "/home/demo/kohya_ss/venv/bin/accelerate", line 8, in sys.exit(main()) File "/home/demo/kohya_ss/venv/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main args.func(args) File "/home/demo/kohya_ss/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1017, in launch_command simple_launcher(args) File "/home/demo/kohya_ss/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 637, in simple_launcher raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) subprocess.CalledProcessError: Command '['/home/demo/kohya_ss/venv/bin/python', './sdxl_train_network.py', '--enable_bucket', '--min_bucket_reso=256', '--max_bucket_reso=2048', '--pretrained_model_name_or_path=stabilityai/stable-diffusion-xl-base-1.0', '--train_data_dir=/home/demo/pat/Training/img', '--resolution=1024,1024', '--output_dir=/home/demo/pat/Training/model', '--logging_dir=/home/demo/pat/Training/log', '--network_alpha=1', '--save_model_as=safetensors', '--network_module=networks.lora', '--text_encoder_lr=0.0004', '--unet_lr=0.0004', '--network_dim=256', '--output_name=pat-arc-run1', '--lr_scheduler_num_cycles=4', '--no_half_vae', '--learning_rate=0.0004', '--lr_scheduler=constant', '--train_batch_size=2', '--max_train_steps=1680', '--save_every_n_epochs=1', '--mixed_precision=bf16', '--save_precision=bf16', '--cache_latents', '--cache_latents_to_disk', '--optimizer_type=Adafactor', '--optimizer_args', 'scale_parameter=False', 'relative_step=False', 'warmup_init=False', '--max_grad_norm=1', '--max_data_loader_n_workers=0', '--bucket_reso_steps=64', '--save_state', '--gradient_checkpointing', '--sdpa', '--bucket_no_upscale', '--noise_offset=0.0']' returned non-zero exit status 1.

Disty0 commented 7 months ago

The guide tells you to use SDPA, not xFormers:

https://www.technopat.net/sosyal/konu/installing-kohya-ss-with-intel-arc-gpus.2869152/

Run source venv/bin/activate and pip uninstall xformers

ep150de commented 7 months ago

xformers is not installed & I had followed the guide and set cross attention to -sdpa

ep150de commented 7 months ago

in addition, when I try from the cmd line I get this result where it hangs and doesn't process any steps for training lora:

16:10:15-259392 INFO Loading config... 16:10:15-297391 INFO SDXL model selected. Setting sdxl parameters 16:15:48-351783 INFO Start training LoRA Standard ... 16:15:48-353282 INFO Checking for duplicate image filenames in training data directory... 16:15:48-355283 INFO Valid image folder names found in: C:/Users/Demo/Desktop/pat/round2 training/img 16:15:48-356283 INFO Folder 40_Pat Gelsinger man: 21 images found 16:15:48-356784 INFO Folder 40_Pat Gelsinger man: 840 steps 16:15:48-357284 INFO Total steps: 840 16:15:48-357784 INFO Train batch size: 2 16:15:48-358283 INFO Gradient accumulation steps: 1 16:15:48-358784 INFO Epoch: 4 16:15:48-359284 INFO Regulatization factor: 1 16:15:48-359783 INFO max_train_steps (840 / 2 / 1 4 1) = 1680 16:15:48-360337 INFO stop_text_encoder_training = 0 16:15:48-360783 INFO lr_warmup_steps = 0 16:15:48-361784 INFO Saving training config to C:/Users/Demo/Desktop/pat/round2 training/model\patgv8_arc_20240123-161548.json... 16:15:48-362784 INFO accelerate launch --num_cpu_threads_per_process=2 "./sdxl_train_network.py" --enable_bucket --min_bucket_reso=256 --max_bucket_reso=2048 --pretrained_model_name_or_path="stabilityai/stable-diffusion-xl-base-1.0" --train_data_dir="C:/Users/Demo/Desktop/pat/round2 training/img" --resolution="1024,1024" --output_dir="C:/Users/Demo/Desktop/pat/round2 training/model" --logging_dir="C:/Users/Demo/Desktop/pat/round2 training/log" --network_alpha="1" --save_model_as=safetensors --network_module=networks.lora --text_encoder_lr=0.0004 --unet_lr=0.0004 --network_dim=256 --output_name="patgv8_arc" --lr_scheduler_num_cycles="4" --no_half_vae --learning_rate="0.0004" --lr_scheduler="constant" --train_batch_size="2" --max_train_steps="1680" --save_every_n_epochs="1" --mixed_precision="bf16" --save_precision="bf16" --caption_extension=".txt" --cache_latents --cache_latents_to_disk --optimizer_type="Adafactor" --optimizer_args scale_parameter=False relative_step=False warmup_init=False --max_grad_norm="1" --max_data_loader_n_workers="0" --bucket_reso_steps=64 --save_state --gradient_checkpointing --sdpa --bucket_no_upscale --noise_offset=0.0 prepare tokenizers Using DreamBooth method. prepare images. found directory C:\Users\Demo\Desktop\pat\round2 training\img\40_Pat Gelsinger man contains 21 image files 840 train images with repeating. 0 reg images. no regularization images / 正則化画像が見つかりませんでした [Dataset 0] batch_size: 2 resolution: (1024, 1024) enable_bucket: True min_bucket_reso: 256 max_bucket_reso: 2048 bucket_reso_steps: 64 bucket_no_upscale: True

[Subset 0 of Dataset 0] image_dir: "C:\Users\Demo\Desktop\pat\round2 training\img\40_Pat Gelsinger man" image_count: 21 num_repeats: 40 shuffle_caption: False keep_tokens: 0 keep_tokens_separator: caption_dropout_rate: 0.0 caption_dropout_every_n_epoches: 0 caption_tag_dropout_rate: 0.0 caption_prefix: None caption_suffix: None color_aug: False flip_aug: False face_crop_aug_range: None random_crop: False token_warmup_min: 1, token_warmup_step: 0, is_reg: False class_tokens: Pat Gelsinger man caption_extension: .txt

[Dataset 0] loading image sizes. 100%|████████████████████████████████████████████████████████████████████████████████| 21/21 [00:00<00:00, 3500.25it/s] make buckets min_bucket_reso and max_bucket_reso are ignored if bucket_no_upscale is set, because bucket reso is defined by image size automatically / bucket_no_upscaleが指定された場合は、bucketの解像度は画像サイズから自動計算されるため、min_bucket_resoとmax_bucket_resoは無視されます number of images (including repeats) / 各bucketの画像枚数(繰り返し回数を含む) bucket 0: resolution (576, 320), count: 40 bucket 1: resolution (704, 384), count: 40 bucket 2: resolution (768, 512), count: 40 bucket 3: resolution (768, 576), count: 40 bucket 4: resolution (768, 1216), count: 40 bucket 5: resolution (832, 512), count: 40 bucket 6: resolution (896, 448), count: 40 bucket 7: resolution (896, 512), count: 40 bucket 8: resolution (960, 512), count: 80 bucket 9: resolution (1088, 832), count: 80 bucket 10: resolution (1152, 576), count: 40 bucket 11: resolution (1152, 640), count: 40 bucket 12: resolution (1152, 768), count: 40 bucket 13: resolution (1216, 640), count: 40 bucket 14: resolution (1216, 832), count: 160 bucket 15: resolution (1344, 768), count: 40 mean ar error (without repeats): 0.05671174881904848 preparing accelerator loading model for process 0/1 load Diffusers pretrained models: stabilityai/stable-diffusion-xl-base-1.0, variant=None Loading pipeline components...: 100%|████████████████████████████████████████████████████| 6/6 [00:00<00:00, 7.26it/s] U-Net converted to original U-Net Enable SDPA for U-Net import network module: networks.lora [Dataset 0] caching latents. checking cache validity... 100%|████████████████████████████████████████████████████████████████████████████████| 21/21 [00:00<00:00, 1399.66it/s] caching latents... 0it [00:00, ?it/s] create LoRA network. base dim (rank): 256, alpha: 1.0 neuron dropout: p=None, rank dropout: p=None, module dropout: p=None create LoRA for Text Encoder 1: create LoRA for Text Encoder 2: create LoRA for Text Encoder: 264 modules. create LoRA for U-Net: 722 modules. enable LoRA for text encoder enable LoRA for U-Net prepare optimizer, data loader etc. use Adafactor optimizer | {'scale_parameter': False, 'relative_step': False, 'warmup_init': False} because max_grad_norm is set, clip_grad_norm is enabled. consider set to 0 / max_grad_normが設定されているためclip_grad_normが有効になります。0に設定して無効にしたほうがいいかもしれません constant_with_warmup will be good / スケジューラはconstant_with_warmupが良いかもしれません running training / 学習開始 num train images * repeats / 学習画像の数×繰り返し回数: 840 num reg images / 正則化画像の数: 0 num batches per epoch / 1epochのバッチ数: 420 num epochs / epoch数: 4 batch size per device / バッチサイズ: 2 gradient accumulation steps / 勾配を合計するステップ数 = 1 total optimization steps / 学習ステップ数: 1680 steps: 0%| | 0/1680 [00:00<?, ?it/s] epoch 1/4

Disty0 commented 7 months ago

3 possibilities: 1) It has Jit trace time at the first step, wait a bit. 2) You ran out of VRAM but driver doesn't register it. 3) Windows. I've never used Windows in years, so i can't tell if anything in there works or not.

ep150de commented 7 months ago

I left it running overnight to see if it was actually making progress:

Here's the trace time for the first step: 1/1680 [6:48:42<11437:05:57, 24522.67s/it, avr_loss=0.0762]

It seems it is running on CPU only? (when trying to run in ubuntu WSL, I still get the errors with the same settings/config and using SDPA

ep150de commented 7 months ago

This is the output when running in WSL/Ubuntu

16:42:24-711244 INFO Loading config... 16:42:24-753031 INFO SDXL model selected. Setting sdxl parameters 16:43:14-769328 INFO Start training LoRA Standard ... 16:43:14-769907 INFO Checking for duplicate image filenames in training data directory... 16:43:14-771327 INFO Valid image folder names found in: /home/demo/pat/Training/img 16:43:14-772033 INFO Folder 40_Pat Gelsinger man: 21 images found 16:43:14-772506 INFO Folder 40_Pat Gelsinger man: 840 steps 16:43:14-772871 INFO Total steps: 840 16:43:14-773190 INFO Train batch size: 2 16:43:14-773505 INFO Gradient accumulation steps: 1 16:43:14-773827 INFO Epoch: 4 16:43:14-774158 INFO Regulatization factor: 1 16:43:14-774487 INFO max_train_steps (840 / 2 / 1 4 1) = 1680 16:43:14-774904 INFO stop_text_encoder_training = 0 16:43:14-775242 INFO lr_warmup_steps = 0 16:43:14-775636 INFO Saving training config to /home/demo/pat/Training/model/pat-arc-run1_20240124-164314.json... 16:43:14-776170 INFO accelerate launch --num_cpu_threads_per_process=2 "./sdxl_train_network.py" --enable_bucket --min_bucket_reso=256 --max_bucket_reso=2048 --pretrained_model_name_or_path="stabilityai/stable-diffusion-xl-base-1.0" --train_data_dir="/home/demo/pat/Training/img" --resolution="1024,1024" --output_dir="/home/demo/pat/Training/model" --logging_dir="/home/demo/pat/Training/log" --network_alpha="1" --save_model_as=safetensors --network_module=networks.lora --text_encoder_lr=0.0004 --unet_lr=0.0004 --network_dim=256 --output_name="pat-arc-run1" --lr_scheduler_num_cycles="4" --no_half_vae --learning_rate="0.0004" --lr_scheduler="constant" --train_batch_size="2" --max_train_steps="1680" --save_every_n_epochs="1" --mixed_precision="bf16" --save_precision="bf16" --caption_extension=".txt" --cache_latents --cache_latents_to_disk --optimizer_type="AdamW" --max_grad_norm="1" --max_data_loader_n_workers="0" --bucket_reso_steps=64 --save_state --gradient_checkpointing --sdpa --bucket_no_upscale --noise_offset=0.0 /home/demo/kohya/kohya_ss/venv/lib/python3.10/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: ''If you don't plan on using image functionality from torchvision.io, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have libjpeg or libpng installed before building torchvision from source? warn( /home/demo/kohya/kohya_ss/venv/lib/python3.10/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: ''If you don't plan on using image functionality from torchvision.io, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have libjpeg or libpng installed before building torchvision from source? warn( 2024-01-24 16:43:18.456803: I tensorflow/core/util/port.cc:111] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0. 2024-01-24 16:43:18.486481: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used. 2024-01-24 16:43:18.624320: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered 2024-01-24 16:43:18.624357: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered 2024-01-24 16:43:18.625230: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered 2024-01-24 16:43:18.699991: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used. 2024-01-24 16:43:18.700681: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2024-01-24 16:43:19.383030: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT 2024-01-24 16:43:19.978691: I itex/core/wrapper/itex_cpu_wrapper.cc:70] Intel Extension for Tensorflow AVX2 CPU backend is loaded. 2024-01-24 16:43:20.727599: I itex/core/wrapper/itex_gpu_wrapper.cc:35] Intel Extension for Tensorflow GPU backend is loaded. 2024-01-24 16:43:20.846529: I itex/core/devices/gpu/itex_gpu_runtime.cc:129] Selected platform: Intel(R) Level-Zero 2024-01-24 16:43:20.846570: I itex/core/devices/gpu/itex_gpu_runtime.cc:154] number of sub-devices is zero, expose root device. 2024-01-24 16:43:20.848340: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:268] failed call to cuInit: UNKNOWN ERROR (100) prepare tokenizers Using DreamBooth method. prepare images. found directory /home/demo/pat/Training/img/40_Pat Gelsinger man contains 21 image files 840 train images with repeating. 0 reg images. no regularization images / 正則化画像が見つかりませんでした [Dataset 0] batch_size: 2 resolution: (1024, 1024) enable_bucket: True min_bucket_reso: 256 max_bucket_reso: 2048 bucket_reso_steps: 64 bucket_no_upscale: True

[Subset 0 of Dataset 0] image_dir: "/home/demo/pat/Training/img/40_Pat Gelsinger man" image_count: 21 num_repeats: 40 shuffle_caption: False keep_tokens: 0 keep_tokens_separator: caption_dropout_rate: 0.0 caption_dropout_every_n_epoches: 0 caption_tag_dropout_rate: 0.0 caption_prefix: None caption_suffix: None color_aug: False flip_aug: False face_crop_aug_range: None random_crop: False token_warmup_min: 1, token_warmup_step: 0, is_reg: False class_tokens: Pat Gelsinger man caption_extension: .txt

[Dataset 0] loading image sizes. 100%|█████████████████████████████████████████████████████████████████████| 21/21 [00:00<00:00, 2565.62it/s] make buckets min_bucket_reso and max_bucket_reso are ignored if bucket_no_upscale is set, because bucket reso is defined by image size automatically / bucket_no_upscaleが指定された場合は、bucketの解像度は画像サイズから自動計算さ れるため、min_bucket_resoとmax_bucket_resoは無視されます number of images (including repeats) / 各bucketの画像枚数(繰り返し回数を含む) bucket 0: resolution (576, 320), count: 40 bucket 1: resolution (704, 384), count: 40 bucket 2: resolution (768, 512), count: 40 bucket 3: resolution (768, 576), count: 40 bucket 4: resolution (768, 1216), count: 40 bucket 5: resolution (832, 512), count: 40 bucket 6: resolution (896, 448), count: 40 bucket 7: resolution (896, 512), count: 40 bucket 8: resolution (960, 512), count: 80 bucket 9: resolution (1088, 832), count: 80 bucket 10: resolution (1152, 576), count: 40 bucket 11: resolution (1152, 640), count: 40 bucket 12: resolution (1152, 768), count: 40 bucket 13: resolution (1216, 640), count: 40 bucket 14: resolution (1216, 832), count: 160 bucket 15: resolution (1344, 768), count: 40 mean ar error (without repeats): 0.05671174881904848 preparing accelerator loading model for process 0/1 load Diffusers pretrained models: stabilityai/stable-diffusion-xl-base-1.0, variant=None Loading pipeline components...: 100%|█████████████████████████████████████████| 6/6 [00:01<00:00, 4.98it/s] U-Net converted to original U-Net Enable SDPA for U-Net import network module: networks.lora [Dataset 0] caching latents. checking cache validity... 100%|██████████████████████████████████████████████████████████████████████| 21/21 [00:00<00:00, 780.36it/s] caching latents... 0it [00:00, ?it/s] create LoRA network. base dim (rank): 256, alpha: 1.0 neuron dropout: p=None, rank dropout: p=None, module dropout: p=None create LoRA for Text Encoder 1: create LoRA for Text Encoder 2: create LoRA for Text Encoder: 264 modules. create LoRA for U-Net: 722 modules. enable LoRA for text encoder enable LoRA for U-Net prepare optimizer, data loader etc. use AdamW optimizer | {} running training / 学習開始 num train images repeats / 学習画像の数×繰り返し回数: 840 num reg images / 正則化画像の数: 0 num batches per epoch / 1epochのバッチ数: 420 num epochs / epoch数: 4 batch size per device / バッチサイズ: 2 gradient accumulation steps / 勾配を合計するステップ数 = 1 total optimization steps / 学習ステップ数: 1680 steps: 0%| | 0/1680 [00:00<?, ?it/s] epoch 1/4 /home/demo/kohya/kohya_ss/venv/lib/python3.10/site-packages/torch/utils/checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants. warnings.warn( Traceback (most recent call last): File "/home/demo/kohya/kohya_ss/./sdxl_train_network.py", line 189, in trainer.train(args) File "/home/demo/kohya/kohya_ss/train_network.py", line 783, in train noise_pred = self.call_unet( File "/home/demo/kohya/kohya_ss/./sdxl_train_network.py", line 169, in call_unet noise_pred = unet(noisy_latents, timesteps, text_embedding, vector_embedding) File "/home/demo/kohya/kohya_ss/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/home/demo/kohya/kohya_ss/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, *kwargs) File "/home/demo/kohya/kohya_ss/venv/lib/python3.10/site-packages/accelerate/utils/operations.py", line 680, in forward return model_forward(args, kwargs) File "/home/demo/kohya/kohya_ss/venv/lib/python3.10/site-packages/accelerate/utils/operations.py", line 668, in call return convert_to_fp32(self.model_forward(*args, kwargs)) File "/home/demo/kohya/kohya_ss/venv/lib/python3.10/site-packages/torch/amp/autocast_mode.py", line 16, in decorate_autocast return func(*args, *kwargs) File "/home/demo/kohya/kohya_ss/library/sdxl_original_unet.py", line 1105, in forward h = torch.cat([h, hs.pop()], dim=1) File "/home/demo/kohya/kohya_ss/library/ipex/hijacks.py", line 122, in torch_cat return original_torch_cat(tensor, args, kwargs) RuntimeError: could not create a primitive descriptor for a concat primitive steps: 0%| | 0/1680 [00:04<?, ?it/s] Traceback (most recent call last): File "/home/demo/kohya/kohya_ss/venv/bin/accelerate", line 8, in sys.exit(main()) File "/home/demo/kohya/kohya_ss/venv/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main args.func(args) File "/home/demo/kohya/kohya_ss/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1017, in launch_command simple_launcher(args) File "/home/demo/kohya/kohya_ss/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 637, in simple_launcher raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) subprocess.CalledProcessError: Command '['/home/demo/kohya/kohya_ss/venv/bin/python', './sdxl_train_network.py', '--enable_bucket', '--min_bucket_reso=256', '--max_bucket_reso=2048', '--pretrained_model_name_or_path=stabilityai/stable-diffusion-xl-base-1.0', '--train_data_dir=/home/demo/pat/Training/img', '--resolution=1024,1024', '--output_dir=/home/demo/pat/Training/model', '--logging_dir=/home/demo/pat/Training/log', '--network_alpha=1', '--save_model_as=safetensors', '--network_module=networks.lora', '--text_encoder_lr=0.0004', '--unet_lr=0.0004', '--network_dim=256', '--output_name=pat-arc-run1', '--lr_scheduler_num_cycles=4', '--no_half_vae', '--learning_rate=0.0004', '--lr_scheduler=constant', '--train_batch_size=2', '--max_train_steps=1680', '--save_every_n_epochs=1', '--mixed_precision=bf16', '--save_precision=bf16', '--caption_extension=.txt', '--cache_latents', '--cache_latents_to_disk', '--optimizer_type=AdamW', '--max_grad_norm=1', '--max_data_loader_n_workers=0', '--bucket_reso_steps=64', '--save_state', '--gradient_checkpointing', '--sdpa', '--bucket_no_upscale', '--noise_offset=0.0']' returned non-zero exit status 1.

Disty0 commented 7 months ago

You've ran out of VRAM. Even Full BF16 won't save you with these settings. Turn on Full BF16 and then turn down your settings so it will fit in 16 GB VRAM.

Full BF16 + Network Dim 128 with your settings should fit in 16 GB.