Similar issue trying to create textual inversion with 3070 with both fp16 and bf16

prepare tokenizers prepare accelerator loading model for process 0/1 load StableDiffusion checkpoint: C:/Users/grome/stable-diffusion-webui/models/Stable-diffusion/sd_xl_base_1.0_0.9vae.safetensors building U-Net loading U-Net from checkpoint U-Net: building text encoders loading text encoders from checkpoint text encoder 1: text encoder 2: building VAE loading VAE from checkpoint VAE: token length for init words is not same to num_vectors_per_token, init words is repeated or truncated / 初期化単語のトークン長がnum_vectors_per_tokenと合わないため、繰り返しまたは切り捨てが発生します: tokenizer 1, length 3 token length for init words is not same to num_vectors_per_token, init words is repeated or truncated / 初期化単語のトークン長がnum_vectors_per_tokenと合わないため、繰り返しまたは切り捨てが発生します: tokenizer 2, length 3 tokens are added for tokenizer 1: [49408] tokens are added for tokenizer 2: [49408] create embeddings for 1 tokens, for muscleman Use DreamBooth method. prepare images. found directory C:\kohya_ss\test000001\img\20_chris hemsworth man contains 3 image files No caption file found for 3 images. Training will continue without captions for these images. If class token exists, it will be used. / 3枚の画像にキャプションファイルが見つかりませんでした。これらの画像についてはキャプションなしで学習を続行します。class tokenが存在する場合はそれを使います。 C:\kohya_ss\test000001\img\20_chris hemsworth man\2023-09-28 15_23_09.649+0100.jpg C:\kohya_ss\test000001\img\20_chris hemsworth man\20230802_162140.jpg C:\kohya_ss\test000001\img\20_chris hemsworth man\20230819_180411.jpg found directory C:\kohya_ss\test000001\reg\1_man contains 6 image files No caption file found for 6 images. Training will continue without captions for these images. If class token exists, it will be used. / 6枚の画像にキャプションファイルが見つかりませんでした。これらの画像についてはキャプションなしで学習を続行します。class tokenが存在する場合はそれを使います。 C:\kohya_ss\test000001\reg\1_man\0000.png C:\kohya_ss\test000001\reg\1_man\0012.png C:\kohya_ss\test000001\reg\1_man\0026.png C:\kohya_ss\test000001\reg\1_man\0043.png C:\kohya_ss\test000001\reg\1_man\0059.png C:\kohya_ss\test000001\reg\1_man\0069.png... and 1 more 60 train images with repeating. 6 reg images. [Dataset 0] batch_size: 1 resolution: (1024, 1024) enable_bucket: True min_bucket_reso: 256 max_bucket_reso: 2048 bucket_reso_steps: 64 bucket_no_upscale: True

[Subset 0 of Dataset 0] image_dir: "C:\kohya_ss\test000001\img\20_chris hemsworth man" image_count: 3 num_repeats: 20 shuffle_caption: False keep_tokens: 0 caption_dropout_rate: 0.0 caption_dropout_every_n_epoches: 0 caption_tag_dropout_rate: 0.0 caption_prefix: None caption_suffix: None color_aug: False flip_aug: False face_crop_aug_range: None random_crop: False token_warmup_min: 1, token_warmup_step: 0, is_reg: False class_tokens: chris hemsworth man caption_extension: .caption

[Subset 1 of Dataset 0] image_dir: "C:\kohya_ss\test000001\reg\1_man" image_count: 6 num_repeats: 1 shuffle_caption: False keep_tokens: 0 caption_dropout_rate: 0.0 caption_dropout_every_n_epoches: 0 caption_tag_dropout_rate: 0.0 caption_prefix: None caption_suffix: None color_aug: False flip_aug: False face_crop_aug_range: None random_crop: False token_warmup_min: 1, token_warmup_step: 0, is_reg: True class_tokens: man caption_extension: .caption

[Dataset 0] loading image sizes. 100%|██████████████████████████████████████████████████████████████████████████████████| 9/9 [00:00<00:00, 1800.04it/s] make buckets min_bucket_reso and max_bucket_reso are ignored if bucket_no_upscale is set, because bucket reso is defined by image size automatically / bucket_no_upscaleが指定された場合は、bucketの解像度は画像サイズから自動計算されるため、min_bucket_resoとmax_bucket_resoは無視されます number of images (including repeats) / 各bucketの画像枚数（繰り返し回数を含む） bucket 0: resolution (704, 1280), count: 20 bucket 1: resolution (768, 1152), count: 20 bucket 2: resolution (1024, 1024), count: 60 bucket 3: resolution (1088, 832), count: 20 mean ar error (without repeats): 0.008814220608011106 Enable xformers for U-Net A matching Triton is not available, some optimizations will not be enabled. Error caught was: No module named 'triton' [Dataset 0] caching latents. checking cache validity... 100%|██████████████████████████████████████████████████████████████████████████████████| 9/9 [00:00<00:00, 1285.72it/s] caching latents... 0it [00:00, ?it/s] prepare optimizer, data loader etc. use Adafactor optimizer | {'scale_parameter': False, 'relative_step': False, 'warmup_init': False} because max_grad_norm is set, clip_grad_norm is enabled. consider set to 0 / max_grad_normが設定されているためclip_grad_normが有効になります。0に設定して無効にしたほうがいいかもしれません constant_with_warmup will be good / スケジューラはconstant_with_warmupが良いかもしれません ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ C:\kohya_ss\sdxl_train_textual_inversion.py:133 in │ │ │ │ 130 │ args = train_util.read_config_from_file(args, parser) │ │ 131 │ │ │ 132 │ trainer = SdxlTextualInversionTrainer() │ │ ❱ 133 │ trainer.train(args) │ │ 134 │ │ │ │ C:\kohya_ss\train_textual_inversion.py:443 in train │ │ │ │ 440 │ │ │ # text_encoder.text_model.embeddings.token_embedding.requiresgrad(True) │ │ 441 │ │ │ │ 442 │ │ unet.requiresgrad(False) │ │ ❱ 443 │ │ unet.to(accelerator.device, dtype=weight_dtype) │ │ 444 │ │ if args.gradient_checkpointing: # according to TI example in Diffusers, train i │ │ 445 │ │ │ # TODO U-Netをオリジナルに置き換えたのでいらないはずなので、後で確認して消す │ │ 446 │ │ │ unet.train() │ │ │ │ C:\Python310\lib\site-packages\torch\nn\modules\module.py:1145 in to │ │ │ │ 1142 │ │ │ │ │ │ │ non_blocking, memory_format=convert_to_format) │ │ 1143 │ │ │ return t.to(device, dtype if t.is_floating_point() or t.is_complex() else No │ │ 1144 │ │ │ │ ❱ 1145 │ │ return self._apply(convert) │ │ 1146 │ │ │ 1147 │ def register_full_backward_pre_hook( │ │ 1148 │ │ self, │ │ │ │ C:\Python310\lib\site-packages\torch\nn\modules\module.py:797 in _apply │ │ │ │ 794 │ │ │ 795 │ def _apply(self, fn): │ │ 796 │ │ for module in self.children(): │ │ ❱ 797 │ │ │ module._apply(fn) │ │ 798 │ │ │ │ 799 │ │ def compute_should_use_set_data(tensor, tensor_applied): │ │ 800 │ │ │ if torch._has_compatible_shallow_copy_type(tensor, tensor_applied): │ │ │ │ C:\Python310\lib\site-packages\torch\nn\modules\module.py:797 in _apply │ │ │ │ 794 │ │ │ 795 │ def _apply(self, fn): │ │ 796 │ │ for module in self.children(): │ │ ❱ 797 │ │ │ module._apply(fn) │ │ 798 │ │ │ │ 799 │ │ def compute_should_use_set_data(tensor, tensor_applied): │ │ 800 │ │ │ if torch._has_compatible_shallow_copy_type(tensor, tensor_applied): │ │ │ │ C:\Python310\lib\site-packages\torch\nn\modules\module.py:797 in _apply │ │ │ │ 794 │ │ │ 795 │ def _apply(self, fn): │ │ 796 │ │ for module in self.children(): │ │ ❱ 797 │ │ │ module._apply(fn) │ │ 798 │ │ │ │ 799 │ │ def compute_should_use_set_data(tensor, tensor_applied): │ │ 800 │ │ │ if torch._has_compatible_shallow_copy_type(tensor, tensor_applied): │ │ │ │ C:\Python310\lib\site-packages\torch\nn\modules\module.py:797 in _apply │ │ │ │ 794 │ │ │ 795 │ def _apply(self, fn): │ │ 796 │ │ for module in self.children(): │ │ ❱ 797 │ │ │ module._apply(fn) │ │ 798 │ │ │ │ 799 │ │ def compute_should_use_set_data(tensor, tensor_applied): │ │ 800 │ │ │ if torch._has_compatible_shallow_copy_type(tensor, tensor_applied): │ │ │ │ C:\Python310\lib\site-packages\torch\nn\modules\module.py:797 in _apply │ │ │ │ 794 │ │ │ 795 │ def _apply(self, fn): │ │ 796 │ │ for module in self.children(): │ │ ❱ 797 │ │ │ module._apply(fn) │ │ 798 │ │ │ │ 799 │ │ def compute_should_use_set_data(tensor, tensor_applied): │ │ 800 │ │ │ if torch._has_compatible_shallow_copy_type(tensor, tensor_applied): │ │ │ │ C:\Python310\lib\site-packages\torch\nn\modules\module.py:797 in _apply │ │ │ │ 794 │ │ │ 795 │ def _apply(self, fn): │ │ 796 │ │ for module in self.children(): │ │ ❱ 797 │ │ │ module._apply(fn) │ │ 798 │ │ │ │ 799 │ │ def compute_should_use_set_data(tensor, tensor_applied): │ │ 800 │ │ │ if torch._has_compatible_shallow_copy_type(tensor, tensor_applied): │ │ │ │ C:\Python310\lib\site-packages\torch\nn\modules\module.py:797 in _apply │ │ │ │ 794 │ │ │ 795 │ def _apply(self, fn): │ │ 796 │ │ for module in self.children(): │ │ ❱ 797 │ │ │ module._apply(fn) │ │ 798 │ │ │ │ 799 │ │ def compute_should_use_set_data(tensor, tensor_applied): │ │ 800 │ │ │ if torch._has_compatible_shallow_copy_type(tensor, tensor_applied): │ │ │ │ C:\Python310\lib\site-packages\torch\nn\modules\module.py:797 in _apply │ │ │ │ 794 │ │ │ 795 │ def _apply(self, fn): │ │ 796 │ │ for module in self.children(): │ │ ❱ 797 │ │ │ module._apply(fn) │ │ 798 │ │ │ │ 799 │ │ def compute_should_use_set_data(tensor, tensor_applied): │ │ 800 │ │ │ if torch._has_compatible_shallow_copy_type(tensor, tensor_applied): │ │ │ │ C:\Python310\lib\site-packages\torch\nn\modules\module.py:797 in _apply │ │ │ │ 794 │ │ │ 795 │ def _apply(self, fn): │ │ 796 │ │ for module in self.children(): │ │ ❱ 797 │ │ │ module._apply(fn) │ │ 798 │ │ │ │ 799 │ │ def compute_should_use_set_data(tensor, tensor_applied): │ │ 800 │ │ │ if torch._has_compatible_shallow_copy_type(tensor, tensor_applied): │ │ │ │ C:\Python310\lib\site-packages\torch\nn\modules\module.py:820 in _apply │ │ │ │ 817 │ │ │ # track autograd history of param_applied, so we have to use │ │ 818 │ │ │ # with torch.no_grad(): │ │ 819 │ │ │ with torch.no_grad(): │ │ ❱ 820 │ │ │ │ param_applied = fn(param) │ │ 821 │ │ │ should_use_set_data = compute_should_use_set_data(param, param_applied) │ │ 822 │ │ │ if should_use_set_data: │ │ 823 │ │ │ │ param.data = param_applied │ │ │ │ C:\Python310\lib\site-packages\torch\nn\modules\module.py:1143 in convert │ │ │ │ 1140 │ │ │ if convert_to_format is not None and t.dim() in (4, 5): │ │ 1141 │ │ │ │ return t.to(device, dtype if t.is_floating_point() or t.is_complex() els │ │ 1142 │ │ │ │ │ │ │ non_blocking, memory_format=convert_to_format) │ │ ❱ 1143 │ │ │ return t.to(device, dtype if t.is_floating_point() or t.is_complex() else No │ │ 1144 │ │ │ │ 1145 │ │ return self._apply(convert) │ │ 1146 │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ OutOfMemoryError: CUDA out of memory. Tried to allocate 26.00 MiB (GPU 0; 8.00 GiB total capacity; 7.04 GiB already allocated; 0 bytes free; 7.30 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ C:\Python310\lib\runpy.py:196 in _run_module_as_main │ │ │ │ 193 │ main_globals = sys.modules["main"].dict │ │ 194 │ if alter_argv: │ │ 195 │ │ sys.argv[0] = mod_spec.origin │ │ ❱ 196 │ return _run_code(code, main_globals, None, │ │ 197 │ │ │ │ │ "main", mod_spec) │ │ 198 │ │ 199 def run_module(mod_name, init_globals=None, │ │ │ │ C:\Python310\lib\runpy.py:86 in _run_code │ │ │ │ 83 │ │ │ │ │ loader = loader, │ │ 84 │ │ │ │ │ package = pkg_name, │ │ 85 │ │ │ │ │ spec = mod_spec) │ │ ❱ 86 │ exec(code, run_globals) │ │ 87 │ return run_globals │ │ 88 │ │ 89 def _run_module_code(code, init_globals=None, │ │ │ │ in :7 │ │ │ │ 4 from accelerate.commands.accelerate_cli import main │ │ 5 if name == 'main': │ │ 6 │ sys.argv[0] = re.sub(r'(-script.pyw|.exe)?$', '', sys.argv[0]) │ │ ❱ 7 │ sys.exit(main()) │ │ 8 │ │ │ │ C:\Python310\lib\site-packages\accelerate\commands\accelerate_cli.py:45 in main │ │ │ │ 42 │ │ exit(1) │ │ 43 │ │ │ 44 │ # Run │ │ ❱ 45 │ args.func(args) │ │ 46 │ │ 47 │ │ 48 if name == "main": │ │ │ │ C:\Python310\lib\site-packages\accelerate\commands\launch.py:918 in launch_command │ │ │ │ 915 │ elif defaults is not None and defaults.compute_environment == ComputeEnvironment.AMA │ │ 916 │ │ sagemaker_launcher(defaults, args) │ │ 917 │ else: │ │ ❱ 918 │ │ simple_launcher(args) │ │ 919 │ │ 920 │ │ 921 def main(): │ │ │ │ C:\Python310\lib\site-packages\accelerate\commands\launch.py:580 in simple_launcher │ │ │ │ 577 │ process.wait() │ │ 578 │ if process.returncode != 0: │ │ 579 │ │ if not args.quiet: │ │ ❱ 580 │ │ │ raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) │ │ 581 │ │ else: │ │ 582 │ │ │ sys.exit(1) │ │ 583 │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ CalledProcessError: Command '['C:\Python310\python.exe', './sdxl_train_textual_inversion.py', '--enable_bucket', '--min_bucket_reso=256', '--max_bucket_reso=2048', '--pretrained_model_name_or_path=C:/Users/grome/stable-diffusion-webui/models/Stable-diffusion/sd_xl_base_1.0_0.9vae.saf etensors', '--train_data_dir=C:/kohya_ss/test000001/img', '--reg_data_dir=C:/kohya_ss/test000001/reg', '--resolution=1024,1024', '--output_dir=C:/kohya_ss/test000001', '--save_model_as=safetensors', '--output_name=furidrums', '--lr_scheduler_num_cycles=1', '--max_data_loader_n_workers=0', '--no_half_vae', '--learning_rate=0.0003', '--lr_scheduler=constant', '--train_batch_size=1', '--max_train_steps=120', '--save_every_n_epochs=1', '--mixed_precision=fp16', '--save_precision=fp16', '--cache_latents', '--cache_latents_to_disk', '--optimizer_type=Adafactor', '--optimizer_args', 'scale_parameter=False', 'relative_step=False', 'warmup_init=False', '--max_data_loader_n_workers=0', '--bucket_reso_steps=64', '--xformers', '--bucket_no_upscale', '--noise_offset=0.0', '--token_string=muscleman', '--init_word=*furidrums', '--num_vectors_per_token=1']' returned non-zero exit status 1.

bitsandbytes-foundation / bitsandbytes

Similar issue trying to create textual inversion with 3070 with both fp16 and bf16 #798