[Bug]: Torch.cuda error during Textual Inversion training

Is there an existing issue for this?

[X] I have searched the existing issues and checked the recent builds/commits

What happened?

Something is not 100% for me either, because when I copy a previous step back and then resume the training, I get interesting torch errors. After that neither the preview nor the process updated on the UI. I stopped the workout, then resumed it, then it went fine for a while, but sometimes it came up again, sometimes not. I don't use --xformers --medvram --precision full --no-half options. My card is RTX 3060 12GB.

Traceback (most recent call last): | 260/20000 [06:55<1:16:29, 4.30it/s] File "H:\Stable-Diffusion-Automatic\stable-diffusion-webui\venv\lib\site-packages\gradio\routes.py", line 275, in run_predict output = await app.blocks.process_api( File "H:\Stable-Diffusion-Automatic\stable-diffusion-webui\venv\lib\site-packages\gradio\blocks.py", line 787, in process_api result = await self.call_function(fn_index, inputs, iterator) File "H:\Stable-Diffusion-Automatic\stable-diffusion-webui\venv\lib\site-packages\gradio\blocks.py", line 694, in call_function prediction = await anyio.to_thread.run_sync( File "H:\Stable-Diffusion-Automatic\stable-diffusion-webui\venv\lib\site-packages\anyio\to_thread.py", line 31, in run_sync return await get_asynclib().run_sync_in_worker_thread( File "H:\Stable-Diffusion-Automatic\stable-diffusion-webui\venv\lib\site-packages\anyio_backends_asyncio.py", line 937, in run_sync_in_worker_thread return await future File "H:\Stable-Diffusion-Automatic\stable-diffusion-webui\venv\lib\site-packages\anyio_backends_asyncio.py", line 867, in run result = context.run(func, args) File "H:\Stable-Diffusion-Automatic\stable-diffusion-webui\modules\ui.py", line 526, in fn=lambda: check_progress_call(id_part), File "H:\Stable-Diffusion-Automatic\stable-diffusion-webui\modules\ui.py", line 284, in check_progress_call shared.state.current_image = modules.sd_samplers.samples_to_image_grid(shared.state.current_latent) File "H:\Stable-Diffusion-Automatic\stable-diffusion-webui\modules\sd_samplers.py", line 100, in samples_to_image_grid return images.image_grid([single_sample_to_image(sample) for sample in samples]) File "H:\Stable-Diffusion-Automatic\stable-diffusion-webui\modules\sd_samplers.py", line 100, in return images.image_grid([single_sample_to_image(sample) for sample in samples]) File "H:\Stable-Diffusion-Automatic\stable-diffusion-webui\modules\sd_samplers.py", line 88, in single_sample_to_image x_sample = processing.decode_first_stage(shared.sd_model, sample.unsqueeze(0))[0] File "H:\Stable-Diffusion-Automatic\stable-diffusion-webui\modules\processing.py", line 367, in decode_first_stage x = model.decode_first_stage(x) File "H:\Stable-Diffusion-Automatic\stable-diffusion-webui\venv\lib\site-packages\torch\autograd\grad_mode.py", line 27, in decorate_context return func(args, kwargs) File "H:\Stable-Diffusion-Automatic\stable-diffusion-webui\repositories\stable-diffusion\ldm\models\diffusion\ddpm.py", line 763, in decode_first_stage return self.first_stage_model.decode(z) File "H:\Stable-Diffusion-Automatic\stable-diffusion-webui\repositories\stable-diffusion\ldm\models\autoencoder.py", line 332, in decode dec = self.decoder(z) File "H:\Stable-Diffusion-Automatic\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl return forward_call(*input, *kwargs) File "H:\Stable-Diffusion-Automatic\stable-diffusion-webui\repositories\stable-diffusion\ldm\modules\diffusionmodules\model.py", line 553, in forward h = self.up[i_level].block[i_block](h, temb) File "H:\Stable-Diffusion-Automatic\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl return forward_call(input, kwargs) File "H:\Stable-Diffusion-Automatic\stable-diffusion-webui\repositories\stable-diffusion\ldm\modules\diffusionmodules\model.py", line 125, in forward h = self.conv1(h) File "H:\Stable-Diffusion-Automatic\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "H:\Stable-Diffusion-Automatic\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\conv.py", line 457, in forward return self._conv_forward(input, self.weight, self.bias) File "H:\Stable-Diffusion-Automatic\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\conv.py", line 453, in _conv_forward return F.conv2d(input, weight, bias, self.stride, RuntimeError: Input type (torch.cuda.HalfTensor) and weight type (torch.HalfTensor) should be the same

Steps to reproduce the problem

Create a Textual Inversion embedding with 4 vector, 18 images, Initialization text: caricature, image size is 512x512, Learning rate: 5e-04:200, 5e-05:500, 5e-06:800, 5e-07:1000 Max steps: 1000, preview and embedding save on every 50th step
Train embedding
Try the train again with a different name if you didn't get an error

What should have happened?

In previous versions there was no error during training

Commit where the problem happens

198a1ffcfc963a3d74674fad560e87dbebf7949f

What platforms do you use to access UI ?

Windows

What browsers do you use to access the UI ?

Google Chrome

Command Line Arguments

--ui-config-file 1my/ui-config-my.json --ui-settings-file 1my/config-my.json --autolaunch --gradio-img2img-tool color-sketch --vae-path "models\Stable-diffusion\newVAE.vae.pt"

Additional information, context and logs

My train file content here:

a caricature art by [name]
a caricature, art by [name]
a caricature by [name]
art by [name]

This first training session I got the above message at 600 steps with vector 1. I stopped at 800, deleted the file and log, then recreated it, this time training with 5e-03:200, 5e-04:400, 5e-05:800, 5e-06:1000 LR for the second time. At 150 steps I got this:

Traceback (most recent call last): | 380/20000 [10:32<1:19:31, 4.11it/s] File "H:\Stable-Diffusion-Automatic\stable-diffusion-webui\venv\lib\site-packages\gradio\routes.py", line 283, in run_predict output = await app.blocks.process_api( File "H:\Stable-Diffusion-Automatic\stable-diffusion-webui\venv\lib\site-packages\gradio\blocks.py", line 936, in process_api result = await self.call_function(fn_index, inputs, iterator) File "H:\Stable-Diffusion-Automatic\stable-diffusion-webui\venv\lib\site-packages\gradio\blocks.py", line 777, in call_function prediction = await anyio.to_thread.run_sync( File "H:\Stable-Diffusion-Automatic\stable-diffusion-webui\venv\lib\site-packages\anyio\to_thread.py", line 31, in run_sync return await get_asynclib().run_sync_in_worker_thread( File "H:\Stable-Diffusion-Automatic\stable-diffusion-webui\venv\lib\site-packages\anyio_backends_asyncio.py", line 937, in run_sync_in_worker_thread return await future File "H:\Stable-Diffusion-Automatic\stable-diffusion-webui\venv\lib\site-packages\anyio_backends_asyncio.py", line 867, in run result = context.run(func, args) File "H:\Stable-Diffusion-Automatic\stable-diffusion-webui\modules\ui.py", line 526, in fn=lambda: check_progress_call(id_part), File "H:\Stable-Diffusion-Automatic\stable-diffusion-webui\modules\ui.py", line 284, in check_progress_call shared.state.current_image = modules.sd_samplers.samples_to_image_grid(shared.state.current_latent) File "H:\Stable-Diffusion-Automatic\stable-diffusion-webui\modules\sd_samplers.py", line 101, in samples_to_image_grid return images.image_grid([single_sample_to_image(sample) for sample in samples]) File "H:\Stable-Diffusion-Automatic\stable-diffusion-webui\modules\sd_samplers.py", line 101, in return images.image_grid([single_sample_to_image(sample) for sample in samples]) File "H:\Stable-Diffusion-Automatic\stable-diffusion-webui\modules\sd_samplers.py", line 89, in single_sample_to_image x_sample = processing.decode_first_stage(shared.sd_model, sample.unsqueeze(0))[0] File "H:\Stable-Diffusion-Automatic\stable-diffusion-webui\modules\processing.py", line 371, in decode_first_stage x = model.decode_first_stage(x) File "H:\Stable-Diffusion-Automatic\stable-diffusion-webui\venv\lib\site-packages\torch\autograd\grad_mode.py", line 27, in decorate_context return func(args, kwargs) File "H:\Stable-Diffusion-Automatic\stable-diffusion-webui\repositories\stable-diffusion\ldm\models\diffusion\ddpm.py", line 763, in decode_first_stage return self.first_stage_model.decode(z) File "H:\Stable-Diffusion-Automatic\stable-diffusion-webui\repositories\stable-diffusion\ldm\models\autoencoder.py", line 332, in decode dec = self.decoder(z) [Epoch 0: 150/1800]loss: 0.0754993: 15%|▍ | 150/1000 [01:37<31:12, 2.20s/it] File "H:\Stable-Diffusion-Automatic\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl return forward_call(*input, *kwargs) File "H:\Stable-Diffusion-Automatic\stable-diffusion-webui\repositories\stable-diffusion\ldm\modules\diffusionmodules\model.py", line 553, in forward h = self.up[i_level].block[i_block](h, temb) File "H:\Stable-Diffusion-Automatic\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl return forward_call(input, kwargs) File "H:\Stable-Diffusion-Automatic\stable-diffusion-webui\repositories\stable-diffusion\ldm\modules\diffusionmodules\model.py", line 123, in forward h = self.norm1(h) File "H:\Stable-Diffusion-Automatic\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "H:\Stable-Diffusion-Automatic\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\normalization.py", line 272, in forward return F.group_norm( File "H:\Stable-Diffusion-Automatic\stable-diffusion-webui\venv\lib\site-packages\torch\nn\functional.py", line 2516, in group_norm return torch.group_norm(input, num_groups, weight, bias, eps, torch.backends.cudnn.enabled) RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument weight in method wrapper__native_group_norm)

I don't know where that 380/20000 came from, because I don't have 20000 anywhere, so that's weird too. After this error, the training became unusable if I continued. I had to start again.

Update 1: I restarted SD and now at 300 steps the error in the original post came up. Must be some interesting error that a workout doesn't go through. Fortunately, despite the torch error, the model doesn't break in this case, just the interface doesn't update, while for the tensor and devices error, the further results are no longer usable. It was included with yesterday's codes because it went bad after an update.

Update 2: the bug is still present in 172c4bc09f0866e7dd114068ebe0f9abfe79ef33, already at the 50th step this devices problem occurred.

I just got a new error after the first 50 steps (auto-save point) in the latest version cd5eafaf03a25d2b0e35154666947b9489078af9:

[Epoch 0: 50/1000]loss: 0.1734919: 5%|▏ | 49/1000 [00:36<11:44, 1.35it/s] Applying cross attention optimization (Doggettx). Error completing request Arguments: ('rejtocartoon', '5e-03:200, 5e-04:500, 5e-05:800, 5e-06:1000', 1, 'H:\Stable-Diffusion-Automatic\textual inversion\rejto\dest', 'H:\Stable-Diffusion-Automatic\textual inversion\rejto\log', 512, 512, 1000, 50, 50, 'H:\Stable-Diffusion-Automatic\textual inversion\rejto\cartoonstyle.txt', False, False, '', '', 20, 0, 7, -1.0, 640, 640) {} Traceback (most recent call last): File "H:\Stable-Diffusion-Automatic\stable-diffusion-webui\modules\ui.py", line 185, in f res = list(func(*args, *kwargs)) File "H:\Stable-Diffusion-Automatic\stable-diffusion-webui\webui.py", line 55, in f res = func(args, kwargs) File "H:\Stable-Diffusion-Automatic\stable-diffusion-webui\modules\textual_inversion\ui.py", line 33, in train_embedding embedding, filename = modules.textual_inversion.textual_inversion.train_embedding(args) File "H:\Stable-Diffusion-Automatic\stable-diffusion-webui\modules\textual_inversion\textual_inversion.py", line 365, in train_embedding shared.sd_model.first_stage_model.to(devices.cpu) File "H:\Stable-Diffusion-Automatic\stable-diffusion-webui\venv\lib\site-packages\pytorch_lightning\core\mixins\device_dtype_mixin.py", line 113, in to return super().to(args, kwargs) File "H:\Stable-Diffusion-Automatic\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 927, in to return self._apply(convert) File "H:\Stable-Diffusion-Automatic\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 579, in _apply module._apply(fn) File "H:\Stable-Diffusion-Automatic\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 579, in _apply module._apply(fn) File "H:\Stable-Diffusion-Automatic\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 579, in _apply module._apply(fn) [Previous line repeated 3 more times] File "H:\Stable-Diffusion-Automatic\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 602, in _apply param_applied = fn(param) File "H:\Stable-Diffusion-Automatic\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 925, in convert return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking) RuntimeError: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Traceback (most recent call last): File "H:\Stable-Diffusion-Automatic\stable-diffusion-webui\venv\lib\site-packages\gradio\routes.py", line 283, in run_predict output = await app.blocks.process_api( File "H:\Stable-Diffusion-Automatic\stable-diffusion-webui\venv\lib\site-packages\gradio\blocks.py", line 936, in process_api result = await self.call_function(fn_index, inputs, iterator) File "H:\Stable-Diffusion-Automatic\stable-diffusion-webui\venv\lib\site-packages\gradio\blocks.py", line 777, in call_function prediction = await anyio.to_thread.run_sync( File "H:\Stable-Diffusion-Automatic\stable-diffusion-webui\venv\lib\site-packages\anyio\to_thread.py", line 31, in run_sync return await get_asynclib().run_sync_in_worker_thread( File "H:\Stable-Diffusion-Automatic\stable-diffusion-webui\venv\lib\site-packages\anyio_backends_asyncio.py", line 937, in run_sync_in_worker_thread return await future File "H:\Stable-Diffusion-Automatic\stable-diffusion-webui\venv\lib\site-packages\anyio_backends_asyncio.py", line 867, in run result = context.run(func, args) File "H:\Stable-Diffusion-Automatic\stable-diffusion-webui\modules\ui.py", line 518, in fn=lambda: check_progress_call(id_part), File "H:\Stable-Diffusion-Automatic\stable-diffusion-webui\modules\ui.py", line 280, in check_progress_call shared.state.set_current_image() File "H:\Stable-Diffusion-Automatic\stable-diffusion-webui\modules\shared.py", line 194, in set_current_image self.current_image = sd_samplers.samples_to_image_grid(self.current_latent) File "H:\Stable-Diffusion-Automatic\stable-diffusion-webui\modules\sd_samplers.py", line 101, in samples_to_image_grid return images.image_grid([single_sample_to_image(sample) for sample in samples]) File "H:\Stable-Diffusion-Automatic\stable-diffusion-webui\modules\sd_samplers.py", line 101, in return images.image_grid([single_sample_to_image(sample) for sample in samples]) File "H:\Stable-Diffusion-Automatic\stable-diffusion-webui\modules\sd_samplers.py", line 89, in single_sample_to_image x_sample = processing.decode_first_stage(shared.sd_model, sample.unsqueeze(0))[0] File "H:\Stable-Diffusion-Automatic\stable-diffusion-webui\modules\processing.py", line 371, in decode_first_stage x = model.decode_first_stage(x) File "H:\Stable-Diffusion-Automatic\stable-diffusion-webui\venv\lib\site-packages\torch\autograd\grad_mode.py", line 27, in decorate_context return func(args, kwargs) File "H:\Stable-Diffusion-Automatic\stable-diffusion-webui\repositories\stable-diffusion\ldm\models\diffusion\ddpm.py", line 763, in decode_first_stage return self.first_stage_model.decode(z) File "H:\Stable-Diffusion-Automatic\stable-diffusion-webui\repositories\stable-diffusion\ldm\models\autoencoder.py", line 332, in decode dec = self.decoder(z) File "H:\Stable-Diffusion-Automatic\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl return forward_call(*input, *kwargs) File "H:\Stable-Diffusion-Automatic\stable-diffusion-webui\repositories\stable-diffusion\ldm\modules\diffusionmodules\model.py", line 553, in forward h = self.up[i_level].block[i_block](h, temb) File "H:\Stable-Diffusion-Automatic\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl return forward_call(input, kwargs) File "H:\Stable-Diffusion-Automatic\stable-diffusion-webui\repositories\stable-diffusion\ldm\modules\diffusionmodules\model.py", line 133, in forward h = self.conv2(h) File "H:\Stable-Diffusion-Automatic\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "H:\Stable-Diffusion-Automatic\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\conv.py", line 457, in forward return self._conv_forward(input, self.weight, self.bias) File "H:\Stable-Diffusion-Automatic\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\conv.py", line 453, in _conv_forward return F.conv2d(input, weight, bias, self.stride, RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED You can try to repro this exception using the following code snippet. If that doesn't trigger the error, please include your original repro script when reporting this issue.

import torch torch.backends.cuda.matmul.allow_tf32 = True torch.backends.cudnn.benchmark = False torch.backends.cudnn.deterministic = False torch.backends.cudnn.allow_tf32 = True data = torch.randn([1, 512, 128, 128], dtype=torch.half, device='cuda', requires_grad=True) net = torch.nn.Conv2d(512, 512, kernel_size=[3, 3], padding=[1, 1], stride=[1, 1], dilation=[1, 1], groups=1) net = net.cuda().half() out = net(data) out.backward(torch.randn_like(out)) torch.cuda.synchronize()

ConvolutionParams memory_format = Contiguous data_type = CUDNN_DATA_HALF padding = [1, 1, 0] stride = [1, 1, 0] dilation = [1, 1, 0] groups = 1 deterministic = false allow_tf32 = true input: TensorDescriptor 0000022292CEB960 type = CUDNN_DATA_HALF nbDims = 4 dimA = 1, 512, 128, 128, strideA = 8388608, 16384, 128, 1, output: TensorDescriptor 0000022292CEBC00 type = CUDNN_DATA_HALF nbDims = 4 dimA = 1, 512, 128, 128, strideA = 8388608, 16384, 128, 1, weight: FilterDescriptor 000002235A8B4220 type = CUDNN_DATA_HALF tensor_format = CUDNN_TENSOR_NCHW nbDims = 4 dimA = 512, 512, 3, 3, Pointer addresses: input: 0000000C4D000000 output: 0000000C36E08000 weight: 00000223EB5A4080 Forward algorithm: 1

It's getting more and more exciting... I am currently training with model-v1-5-pruned-ema-only.ckpt and SD VAE is none.

Okay, you need to look at VAE loading and filling! If the SD VAE is on auto, the training works without errors, but whether it is on model or none, the training throws errors!

Update 1: on the 900th and 1300th steps i got two devices error again... :(

Update 2: I started a new training this morning and when generates an image, the two devices error comes out for sure. Now it came up at 200. The training went to 1000, then I continued to 10000. The UI came back to itself on the Train Embedding button, but on the first 50 saves it gave an error again and the image generation on the interface died, it just continued in the background. Would it be a Gradio error? It's like losing the tool when generating the image.

Okay, I found the source of the bug, the optimization in build f071a1d25aa8b35bb6406a133df1d03ae5ea8d01 does not give the video card back control and confuses the training. Missing is the part that "if xformers is not loaded..." It has not been tested without xformers. I think the problem is lines 335 and 412, it lacks the "if unload:" before it, so it just passes control to the device. shared.sd_model.first_stage_model.to(devices.device) while in lines 277 and 365 the processor gets the lead in the condition:

if unload:
        shared.sd_model.first_stage_model.to(devices.cpu)

@AUTOMATIC1111 or @dfaker please review this code because it causes the above errors, please withdraw or modify it because I can't update until it is included! Try it without xformers! In addition, it also screws with the new SD VAE list (I can't use it, only "auto" option when I'm training)

Oh yes having same problem Textual Inversion still a king here so it would be cool to fix this

I'll wait a few days to fix the problem, but if it's not fixed, I'll revoke the code to restore it to its original state. I'm not a programmer, so I don't have experience with other types of fixes, but I'd love the new features. Maybe even @MarkovInequality could fix the problem.

I just tested the latest build without transformers using the doggettx, and invokeAI optimizations and I can't seem to reproduce the problem. I also tested training with VAE selected as None, auto, or some specific vae and unfortunately I can't replicate the error either. I don't believe this has anything to do with whether you have xformers enabled or not, but rather something to do with moving the VAE from the GPU to CPU. You can temporarily disable this by unticking the "Move VAE and CLIP to RAM when training if possible" option in the settings.

lines 335 and 412, it lacks the "if unload:" before it, so it just passes control to the device.

I don't think this is the problem either, as it unconditionally moves the VAE to the GPU just before we need it to generate the preview image. If the VAE is already on the GPU, then this should be a noop.

Can you send me your launch parameters and a screenshot of the "Training" section in the settings page?

I just updated to the f2b69709eaff88fc3a2bd49585556ec0883bf5ea build. I use these settings and run another workout. I always turn on the "Move VAE and CLIP to RAM when training if possible" option, because the training is much more efficient this way.

I'll start a training now and see if it stops with these parameters:

LR: 5e-03:200, 5e-04:500, 5e-05:800, 5e-06:1000
512x512
Save every 50 steps

Ok, now I got the error at step 350:

[Epoch 0: 351/600]loss: 0.0620605: 35%|█▍ | 351/1000 [03:50<18:38, 1.72s/it]Traceback (most recent call last):██████████████| 20/20 [00:04<00:00, 4.32it/s] File "H:\Stable-Diffusion-Automatic\stable-diffusion-webui\venv\lib\site-packages\gradio\routes.py", line 283, in run_predict output = await app.blocks.process_api( File "H:\Stable-Diffusion-Automatic\stable-diffusion-webui\venv\lib\site-packages\gradio\blocks.py", line 936, in process_api result = await self.call_function(fn_index, inputs, iterator) File "H:\Stable-Diffusion-Automatic\stable-diffusion-webui\venv\lib\site-packages\gradio\blocks.py", line 777, in call_function prediction = await anyio.to_thread.run_sync( File "H:\Stable-Diffusion-Automatic\stable-diffusion-webui\venv\lib\site-packages\anyio\to_thread.py", line 31, in run_sync return await get_asynclib().run_sync_in_worker_thread( File "H:\Stable-Diffusion-Automatic\stable-diffusion-webui\venv\lib\site-packages\anyio_backends_asyncio.py", line 937, in run_sync_in_worker_thread return await future File "H:\Stable-Diffusion-Automatic\stable-diffusion-webui\venv\lib\site-packages\anyio_backends_asyncio.py", line 867, in run result = context.run(func, args) File "H:\Stable-Diffusion-Automatic\stable-diffusion-webui\modules\ui.py", line 518, in fn=lambda: check_progress_call(id_part), File "H:\Stable-Diffusion-Automatic\stable-diffusion-webui\modules\ui.py", line 280, in check_progress_call shared.state.set_current_image() File "H:\Stable-Diffusion-Automatic\stable-diffusion-webui\modules\shared.py", line 194, in set_current_image self.current_image = sd_samplers.samples_to_image_grid(self.current_latent) File "H:\Stable-Diffusion-Automatic\stable-diffusion-webui\modules\sd_samplers.py", line 101, in samples_to_image_grid return images.image_grid([single_sample_to_image(sample) for sample in samples]) File "H:\Stable-Diffusion-Automatic\stable-diffusion-webui\modules\sd_samplers.py", line 101, in return images.image_grid([single_sample_to_image(sample) for sample in samples]) File "H:\Stable-Diffusion-Automatic\stable-diffusion-webui\modules\sd_samplers.py", line 89, in single_sample_to_image x_sample = processing.decode_first_stage(shared.sd_model, sample.unsqueeze(0))[0] File "H:\Stable-Diffusion-Automatic\stable-diffusion-webui\modules\processing.py", line 363, in decode_first_stage x = model.decode_first_stage(x) File "H:\Stable-Diffusion-Automatic\stable-diffusion-webui\venv\lib\site-packages\torch\autograd\grad_mode.py", line 27, in decorate_context return func(args, kwargs) File "H:\Stable-Diffusion-Automatic\stable-diffusion-webui\repositories\stable-diffusion\ldm\models\diffusion\ddpm.py", line 763, in decode_first_stage return self.first_stage_model.decode(z) File "H:\Stable-Diffusion-Automatic\stable-diffusion-webui\repositories\stable-diffusion\ldm\models\autoencoder.py", line 332, in decode dec = self.decoder(z) File "H:\Stable-Diffusion-Automatic\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl return forward_call(*input, *kwargs) File "H:\Stable-Diffusion-Automatic\stable-diffusion-webui\repositories\stable-diffusion\ldm\modules\diffusionmodules\model.py", line 553, in forward h = self.up[i_level].block[i_block](h, temb) File "H:\Stable-Diffusion-Automatic\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl return forward_call(input, kwargs) File "H:\Stable-Diffusion-Automatic\stable-diffusion-webui\repositories\stable-diffusion\ldm\modules\diffusionmodules\model.py", line 125, in forward h = self.conv1(h) File "H:\Stable-Diffusion-Automatic\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "H:\Stable-Diffusion-Automatic\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\conv.py", line 457, in forward return self._conv_forward(input, self.weight, self.bias) File "H:\Stable-Diffusion-Automatic\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\conv.py", line 453, in _conv_forward return F.conv2d(input, weight, bias, self.stride, RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

I'm about to try SD VAE with "none" option again.

Update 1: I got the same error now on the 700th step when VAE was in "none" state. I'll rewrite the code to the state before you changed it and run another test.

I think I found the problem, moving VAE to the CPU doesn't play nicely with "Show image creation progress every N sampling steps". Set it 0 to disable.

Can you try setting that setting to 0 and see if you still get the problem. I think the conflict is causing a race condition that apparently just works out on my computer.

In the meantime, can you try it out on hypernetworks to see if you're also getting the same problem, because I think the same problem might also apply to hypernetowrks as well.

After I took out your changes, the training ran flawlessly:

Training at rate of 0.005 until step 200 Preparing dataset... 100%|████████████████████████████████████████████| 6/6 [00:02<00:00, 2.12it/s] 100%|██████████████████████████████████████████| 20/20 [00:04<00:00, 4.13it/s] 100%|██████████████████████████████████████████| 20/20 [00:04<00:00, 4.14it/s] 100%|██████████████████████████████████████████| 20/20 [00:04<00:00, 4.11it/s] 100%|██████████████████████████████████████████| 20/20 [00:04<00:00, 4.10it/s] Training at rate of 0.0005 until step 500 100%|██████████████████████████████████████████| 20/20 [00:04<00:00, 4.10it/s] 100%|██████████████████████████████████████████| 20/20 [00:04<00:00, 4.10it/s] 100%|██████████████████████████████████████████| 20/20 [00:04<00:00, 4.12it/s] 100%|██████████████████████████████████████████| 20/20 [00:04<00:00, 4.07it/s] 100%|██████████████████████████████████████████| 20/20 [00:04<00:00, 4.10it/s] 100%|██████████████████████████████████████████| 20/20 [00:04<00:00, 4.10it/s] Training at rate of 5e-05 until step 800 100%|██████████████████████████████████████████| 20/20 [00:04<00:00, 4.11it/s] 100%|██████████████████████████████████████████| 20/20 [00:04<00:00, 4.10it/s] 100%|██████████████████████████████████████████| 20/20 [00:04<00:00, 4.10it/s] 100%|██████████████████████████████████████████| 20/20 [00:04<00:00, 4.11it/s] 100%|██████████████████████████████████████████| 20/20 [00:04<00:00, 4.12it/s] 100%|██████████████████████████████████████████| 20/20 [00:05<00:00, 4.00it/s] Training at rate of 5e-06 until step 1000 100%|██████████████████████████████████████████| 20/20 [00:04<00:00, 4.10it/s] 100%|██████████████████████████████████████████| 20/20 [00:04<00:00, 4.10it/s] 100%|██████████████████████████████████████████| 20/20 [00:04<00:00, 4.11it/s] 100%|██████████████████████████████████████████| 20/20 [00:04<00:00, 4.10it/s] [Epoch 1: 400/600]loss: 0.0674431: 100%|███| 1000/1000 [10:40<00:00, 1.56it/s] Applying cross attention optimization (Doggettx).20/20 [00:04<00:00, 4.22it/s]

I'll undo my changes now and set the preview generation from 5 to 0, then I'll come back. I do not generate HN.

Can you please try to train a hypernetwork with "Show image creation progress every N sampling steps" set to 0, with my optimization with the "Move VAE and CLIP to RAM when training if possible" setting both on and off?

I can't seem to trigger the race condition on my computer so I'll have to rely on you to test before I make a PR to fix this issue

It looks like you found the source of the error, because after I set "Show image creation progress every N sampling steps" to zero, the training ran without errors. I set this to 5 so that I could see the training continuously and abort it if there was a problem. I'll check it with HNs in a bit.

Hypernetwork Results (1000 steps): With "Move VAE and CLIP to RAM if possible" option enabled:

"Show image creation progress every N sampling steps" to 0 and the LR constant value: no errors
"Show image creation progress every N sampling steps" to 5 and the LR constant value: no errors
"Show image creation progress every N sampling steps" to 5 and the LR decreasing in steps: no errors

With "Move VAE and CLIP to RAM if possible" option disabled:

"Show image creation progress every N sampling steps" to 0 and the LR constant value: no errors
"Show image creation progress every N sampling steps" to 5 and the LR constant value: no errors
"Show image creation progress every N sampling steps" to 5 and the LR decreasing in steps: (testing) (More results later)

The problem seems to affect only Textual Inversion training. Could it be that something has been fixed in HN that has not been fixed in TI? Just guessing.

Just checked the code for HNs, and if it's happening for TI, it should theoretically happen for HN as well. After running hypernetwork training to double check, I see that the problem causing function for TI, is indeed also being called during HN training. I'll add the fix to HNs just in case. Thanks for testing

Maybe I am misunderstanding something, but the fault is in TI, not HN, everything is fine there.

TI and "Show image creation progress every N sampling steps" to 5 goes to error.

The reason is because both TI and HN attempts to move the VAE to the CPU to save VRAM. However, in a parallel thread, the progress bar is calling sd_samplers.sample_to_image(self.current_latent).

This poses a problem because the function requires the VAE to be on the GPU. On top of that, this call can also occur while we're moving the VAE back to the GPU, which I suspect is what's generating your other errors.

I also managed to produce an error on HN.

Then you've gone deeper. For me, HN didn't give me a bug (although the code has changed again since then in terms of VAE as I look at the builds), but I'm glad if you found the bug and if it is indeed the core of the whole bug.

Thanks for the quick fix, now the first training session with the new code ran without errors. Thanks a lot for your help!

Looks like I have a more or less similar issue but I don't get the same error msg. For me it's running out of memory. When I start playing around mentiond in this thread it's getting worse and worse. After a while it's totaly screwed and simply running txt2image already crashes.

[1.0, 2.0, 1.0]
Activation function is linear
Weight initialization is Normal
Layer norm is set to False
Dropout usage is set to False
Activate last layer is set to False
Optimizer name is AdamW
No saved optimizer exists in checkpoint
Training at rate of 1e-05 until step 2000
Preparing dataset...
100%|██████████████████████████████████████████████████████████████████████████████████| 52/52 [00:03<00:00, 16.92it/s]
Mean loss of 26 elements
  0%|                                                                                         | 0/2000 [00:00<?, ?it/s]
Applying cross attention optimization (Doggettx).
Error completing request
Arguments: ('JJ(b2fb0683)', '0.00001', 1, 'G:\\M\\Stable_Diff_Train_JJ\\preprocessed', 'textual_inversion', 512, 512, 2000, 10, 10, 'C:\\Users\\User\\stable-diffusion-webui\\textual_inversion_templates\\style_filewords.txt', True, 'young woman, au naturel, hyper detailed, digital art, trending in artstation, cinematic lighting, studio quality, smooth render, fluorescent skin, unreal engine 5 rendered, octane rendered, art style by klimt and nixeu and ian sprigger and wlop and krenz cushart', '', 20, 0, 7, -1.0, 512, 512) {}
Traceback (most recent call last):
  File "C:\Users\User\stable-diffusion-webui\modules\ui.py", line 185, in f
    res = list(func(*args, **kwargs))
  File "C:\Users\User\stable-diffusion-webui\webui.py", line 54, in f
    res = func(*args, **kwargs)
  File "C:\Users\User\stable-diffusion-webui\modules\hypernetworks\ui.py", line 50, in train_hypernetwork
    hypernetwork, filename = modules.hypernetworks.hypernetwork.train_hypernetwork(*args)
  File "C:\Users\User\stable-diffusion-webui\modules\hypernetworks\hypernetwork.py", line 483, in train_hypernetwork
    loss.backward()
  File "C:\Users\User\stable-diffusion-webui\venv\lib\site-packages\torch\_tensor.py", line 396, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "C:\Users\User\stable-diffusion-webui\venv\lib\site-packages\torch\autograd\__init__.py", line 173, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "C:\Users\User\stable-diffusion-webui\venv\lib\site-packages\torch\autograd\function.py", line 253, in apply
    return user_fn(self, *args)
  File "C:\Users\User\stable-diffusion-webui\repositories\stable-diffusion\ldm\modules\diffusionmodules\util.py", line 139, in backward
    input_grads = torch.autograd.grad(
  File "C:\Users\User\stable-diffusion-webui\venv\lib\site-packages\torch\autograd\__init__.py", line 276, in grad
    return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: CUDA out of memory. Tried to allocate 512.00 MiB (GPU 0; 8.00 GiB total capacity; 6.46 GiB already allocated; 0 bytes free; 6.84 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [00:03<00:00,  5.06it/s]
Error completing request███████████████████████████████████████████████████████████████| 20/20 [00:03<00:00,  5.93it/s]
Arguments: ('young woman, au naturel, hyper detailed, digital art, trending in artstation, cinematic lighting, studio quality, smooth render, fluorescent skin, unreal engine 5 rendered, octane rendered, art style by klimt and nixeu and ian sprigger and wlop and krenz cushart', '', 'None', 'None', 20, 0, True, False, 1, 1, 7, -1.0, -1.0, 0, 0, 0, False, 512, 512, False, 0.75, 0, 0, 0, 0.9, 5, '0.0001', False, 'None', '', 0.1, False, '<div class="dynamic-prompting">\n    <h3><strong>Combinations</strong></h3>\n\n    Choose a number of terms from a list, in this case we choose two artists: \n    <code class="codeblock">{2$$artist1|artist2|artist3}</code><br/>\n\n    If $$ is not provided, then 1$$ is assumed.<br/><br/>\n\n    If the chosen number of terms is greater than the available terms, then some terms will be duplicated, otherwise chosen terms will be unique. This is useful in the case of wildcards, e.g.\n    <code class="codeblock">{2$$__artist__}</code> is equivalent to <code class="codeblock">{2$$__artist__|__artist__}</code><br/><br/>\n\n    A range can be provided:\n    <code class="codeblock">{1-3$$artist1|artist2|artist3}</code><br/>\n    In this case, a random number of artists between 1 and 3 is chosen.<br/><br/>\n\n    Wildcards can be used and the joiner can also be specified:\n    <code class="codeblock">{{1-$$and$$__adjective__}}</code><br/>\n\n    Here, a random number between 1 and 3 words from adjective.txt will be chosen and joined together with the word \'and\' instead of the default comma.\n\n    <br/><br/>\n\n    <h3><strong>Wildcards</strong></h3>\n    \n\n    <br/>\n    If the groups wont drop down click <strong onclick="check_collapsibles()" style="cursor: pointer">here</strong> to fix the issue.\n\n    <br/><br/>\n\n    <code class="codeblock">WILDCARD_DIR: C:\\Users\\User\\stable-diffusion-webui\\extensions\\sd-dynamic-prompts\\wildcards</code><br/>\n    <small onload="check_collapsibles()">You can add more wildcards by creating a text file with one term per line and name is mywildcards.txt. Place it in C:\\Users\\User\\stable-diffusion-webui\\extensions\\sd-dynamic-prompts\\wildcards. <code class="codeblock">__&#60;folder&#62;/mywildcards__</code> will then become available.</small>\n</div>\n\n', True, False, 1, False, False, 100, 0.7, False, False, False, False, False, False, False, False, False, '', 1, '', 0, '', True, False, False, 1.0, 2.0, 'a painting in', 'style', 'picture frame, portrait photo', None) {}
Traceback (most recent call last):
  File "C:\Users\User\stable-diffusion-webui\modules\ui.py", line 185, in f
    res = list(func(*args, **kwargs))
  File "C:\Users\User\stable-diffusion-webui\webui.py", line 54, in f
    res = func(*args, **kwargs)
  File "C:\Users\User\stable-diffusion-webui\modules\txt2img.py", line 48, in txt2img
    processed = process_images(p)
  File "C:\Users\User\stable-diffusion-webui\modules\processing.py", line 423, in process_images
    res = process_images_inner(p)
  File "C:\Users\User\stable-diffusion-webui\modules\processing.py", line 546, in process_images_inner
    x_sample = modules.face_restoration.restore_faces(x_sample)
  File "C:\Users\User\stable-diffusion-webui\modules\face_restoration.py", line 19, in restore_faces
    return face_restorer.restore(np_image)
  File "C:\Users\User\stable-diffusion-webui\modules\gfpgan_model.py", line 110, in restore
    return gfpgan_fix_faces(np_image)
  File "C:\Users\User\stable-diffusion-webui\modules\gfpgan_model.py", line 59, in gfpgan_fix_faces
    cropped_faces, restored_faces, gfpgan_output_bgr = model.enhance(np_image_bgr, has_aligned=False, only_center_face=False, paste_back=True)
  File "C:\Users\User\stable-diffusion-webui\venv\lib\site-packages\torch\autograd\grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "C:\Users\User\stable-diffusion-webui\venv\lib\site-packages\gfpgan\utils.py", line 145, in enhance
    restored_img = self.face_helper.paste_faces_to_input_image(upsample_img=bg_img)
  File "C:\Users\User\stable-diffusion-webui\venv\lib\site-packages\facexlib\utils\face_restoration_helper.py", line 309, in paste_faces_to_input_image
    out = self.face_parse(face_input)[0]
  File "C:\Users\User\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:\Users\User\stable-diffusion-webui\venv\lib\site-packages\facexlib\parsing\parsenet.py", line 189, in forward
    feat = self.encoder(x)
  File "C:\Users\User\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:\Users\User\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\container.py", line 139, in forward
    input = module(input)
  File "C:\Users\User\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:\Users\User\stable-diffusion-webui\venv\lib\site-packages\facexlib\parsing\parsenet.py", line 135, in forward
    res = self.conv1(x)
  File "C:\Users\User\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:\Users\User\stable-diffusion-webui\venv\lib\site-packages\facexlib\parsing\parsenet.py", line 108, in forward
    out = self.norm(out)
  File "C:\Users\User\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:\Users\User\stable-diffusion-webui\venv\lib\site-packages\facexlib\parsing\parsenet.py", line 39, in forward
    return self.norm(x)
  File "C:\Users\User\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:\Users\User\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\batchnorm.py", line 168, in forward
    return F.batch_norm(
  File "C:\Users\User\stable-diffusion-webui\venv\lib\site-packages\torch\nn\functional.py", line 2438, in batch_norm
    return torch.batch_norm(
RuntimeError: CUDA out of memory. Tried to allocate 128.00 MiB (GPU 0; 8.00 GiB total capacity; 7.11 GiB already allocated; 0 bytes free; 7.21 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

AUTOMATIC1111 / stable-diffusion-webui