[Bug]: RuntimeError: cuDNN error: CUDNN_STATUS_MAPPING_ERROR after completed generation

KataiKi commented 1 year ago

Is there an existing issue for this?

[X] I have searched the existing issues and checked the recent builds/commits

What happened?

I'm unsure if this is a bug, a configuration issue, or a hardware limitation.

I am trying to run a generation at higher resolutions (1024x1024). Generations at 512x512 work fine, as do generations at 1024x512. When I start a generation at 1024x1024, it seems to run just fine. When the generation is complete, however, the entire service crashes with the following error.

File "C:\Projects\stable-diffusion-webui\modules\processing.py", line 634, in <listcomp>
    x_samples_ddim = [decode_first_stage(p.sd_model, samples_ddim[i:i+1].to(dtype=devices.dtype_vae))[0].cpu() for i in range(samples_ddim.size(0))]
RuntimeError: CUDA error: the launch timed out and was terminated
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

With CUDA_LAUNCH_BLOCKING=1

  File "C:\Projects\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\conv.py", line 459, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
RuntimeError: cuDNN error: CUDNN_STATUS_MAPPING_ERROR

A second attempt ends up with this message

File "C:\Projects\stable-diffusion-webui\venv\lib\site-packages\torch\cuda\memory.py", line 125, in empty_cache
    torch._C._cuda_emptyCache()
RuntimeError: CUDA error: the launch timed out and was terminated

The service stops responding altogether and needs to be closed and restarted.

The previews look like everything is working fine. I get to see the finished product for a split second before the service crashes. Nothing gets saved.

The GPU i'm using is pretty slim, a GTX 970 4GB. I'm using a lot of options to reduce the load on the VRAM, but this doesn't look like the usual error for running into VRAM issues.

Google searches seem to point to it being a CUDA/Torch issue: https://github.com/pytorch/pytorch/issues/27588

Steps to reproduce the problem

Start Server
Enter prompt
Click Generate
Observe: The picture generates fine at 512x512
Move size sliders to 1024x1024
Optional: Move Sampling Steps to 1 (this saves time)
Generate
Wait for image to generate. Preview should show progress
At 100%, the image panel is blank.
Observe error in console logs.

What should have happened?

The process should've completed and the image saved properly to disk.

Commit where the problem happens

0cc0ee1b

What platforms do you use to access the UI ?

Windows

What browsers do you use to access the UI ?

Mozilla Firefox, Google Chrome

Command Line Arguments

@echo off

set PYTHON=
set GIT=
set VENV_DIR=

set COMMANDLINE_ARGS=--listen --port=80 --theme=dark --xformers --lowvram --hide-ui-dir-config --freeze-settings --gradio-auth-path=users.auth --opt-sub-quad-attention --use-cpu=interrogate

call webui.bat

List of extensions

No

Console logs

Arguments: ('task(oaredjnhwyzbjwl)', '', '', ['Men of the Mountain'], 20, 0, False, False, 1, 1, 7, -1.0, -1.0, 0, 0, 0, False, 1024, 1024, False, 0.7, 2, 'Latent', 0, 0, 0, [], 0, False, False, 'positive', 'comma', 0, False, False, '', 1, '', 0, '', 0, '', True, False, False, False, 0) {}
Traceback (most recent call last):
  File "C:\Projects\stable-diffusion-webui\modules\call_queue.py", line 56, in f
    res = list(func(*args, **kwargs))
  File "C:\Projects\stable-diffusion-webui\modules\call_queue.py", line 37, in f
    res = func(*args, **kwargs)
  File "C:\Projects\stable-diffusion-webui\modules\txt2img.py", line 56, in txt2img
    processed = process_images(p)
  File "C:\Projects\stable-diffusion-webui\modules\processing.py", line 486, in process_images
    res = process_images_inner(p)
  File "C:\Projects\stable-diffusion-webui\modules\processing.py", line 634, in process_images_inner
    x_samples_ddim = [decode_first_stage(p.sd_model, samples_ddim[i:i+1].to(dtype=devices.dtype_vae))[0].cpu() for i in range(samples_ddim.size(0))]
  File "C:\Projects\stable-diffusion-webui\modules\processing.py", line 634, in <listcomp>
    x_samples_ddim = [decode_first_stage(p.sd_model, samples_ddim[i:i+1].to(dtype=devices.dtype_vae))[0].cpu() for i in range(samples_ddim.size(0))]
RuntimeError: CUDA error: the launch timed out and was terminated
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

With CUDA_LAUNCH_BLOCKING=1.

100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [05:01<00:00, 15.09s/it]
Error completing request███████████████████████████████████████████████████████████████| 20/20 [04:45<00:00, 15.03s/it]
Arguments: ('task(spv2u55w6j6s7zc)', '', '', ['Portrait of a Girl'], 20, 0, False, False, 1, 1, 7, -1.0, -1.0, 0, 0, 0, False, 900, 1600, False, 0.7, 2, 'Latent', 0, 0, 0, [], 0, False, False, 'positive', 'comma', 0, False, False, '', 1, '', 0, '', 0, '', True, False, False, False, 0) {}
Traceback (most recent call last):
  File "C:\Projects\stable-diffusion-webui\modules\call_queue.py", line 56, in f
    res = list(func(*args, **kwargs))
  File "C:\Projects\stable-diffusion-webui\modules\call_queue.py", line 37, in f
    res = func(*args, **kwargs)
  File "C:\Projects\stable-diffusion-webui\modules\txt2img.py", line 56, in txt2img
    processed = process_images(p)
  File "C:\Projects\stable-diffusion-webui\modules\processing.py", line 486, in process_images
    res = process_images_inner(p)
  File "C:\Projects\stable-diffusion-webui\modules\processing.py", line 634, in process_images_inner
    x_samples_ddim = [decode_first_stage(p.sd_model, samples_ddim[i:i+1].to(dtype=devices.dtype_vae))[0].cpu() for i in range(samples_ddim.size(0))]
  File "C:\Projects\stable-diffusion-webui\modules\processing.py", line 634, in <listcomp>
    x_samples_ddim = [decode_first_stage(p.sd_model, samples_ddim[i:i+1].to(dtype=devices.dtype_vae))[0].cpu() for i in range(samples_ddim.size(0))]
  File "C:\Projects\stable-diffusion-webui\modules\processing.py", line 423, in decode_first_stage
    x = model.decode_first_stage(x)
  File "C:\Projects\stable-diffusion-webui\modules\sd_hijack_utils.py", line 17, in <lambda>
    setattr(resolved_obj, func_path[-1], lambda *args, **kwargs: self(*args, **kwargs))
  File "C:\Projects\stable-diffusion-webui\modules\sd_hijack_utils.py", line 28, in __call__
    return self.__orig_func(*args, **kwargs)
  File "C:\Projects\stable-diffusion-webui\venv\lib\site-packages\torch\autograd\grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "C:\Projects\stable-diffusion-webui\repositories\stable-diffusion-stability-ai\ldm\models\diffusion\ddpm.py", line 826, in decode_first_stage
    return self.first_stage_model.decode(z)
  File "C:\Projects\stable-diffusion-webui\modules\lowvram.py", line 52, in first_stage_model_decode_wrap
    return first_stage_model_decode(z)
  File "C:\Projects\stable-diffusion-webui\repositories\stable-diffusion-stability-ai\ldm\models\autoencoder.py", line 90, in decode
    dec = self.decoder(z)
  File "C:\Projects\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:\Projects\stable-diffusion-webui\repositories\stable-diffusion-stability-ai\ldm\modules\diffusionmodules\model.py", line 631, in forward
    h = self.mid.attn_1(h)
  File "C:\Projects\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:\Projects\stable-diffusion-webui\repositories\stable-diffusion-stability-ai\ldm\modules\diffusionmodules\model.py", line 267, in forward
    out = self.proj_out(out)
  File "C:\Projects\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:\Projects\stable-diffusion-webui\extensions-builtin\Lora\lora.py", line 182, in lora_Conv2d_forward
    return lora_forward(self, input, torch.nn.Conv2d_forward_before_lora(self, input))
  File "C:\Projects\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\conv.py", line 463, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "C:\Projects\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\conv.py", line 459, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
RuntimeError: cuDNN error: CUDNN_STATUS_MAPPING_ERROR

Additional information

GTX 970 4GB

Skymaster3 commented 1 year ago

Same problem

KataiKi commented 1 year ago

After some research, maybe this is something can be solved on the OS side.

https://forums.developer.nvidia.com/t/cuda-the-launch-timed-out-and-was-terminated/20337/7

Launch timeouts normally occur because the kernel is taking too long to run on a GPU which has an active display. The driver will kill kernels taking more than a few seconds to complete. The reason why commenting out that line allows the kernel of complete without the timeout is because without the global memory writes, most of the kernel code will be removed by compiler optimsation, leaving you with an empty kernel.

The solution is to reduce the kernel execution time, either by doing less work per kernel call or improving the code efficiency, or some combination of both. The othe alternative is to use a dedicated compute card, which eliminates the display driver time limit altogether.

Some more additional resources of similar matters

https://stackoverflow.com/questions/497685/cuda-apps-time-out-fail-after-several-seconds-how-to-work-around-this

The easiest way to disable TDR for Cuda programming, assuming you have the NVIDIA Nsight tools installed, is to open the Nsight Monitor, click on "Nsight Monitor options", and under "General" set "WDDM TDR enabled" to false. This will change the registry setting for you. Close and reboot. Any change to the TDR registry setting won't take effect until you reboot.

KataiKi commented 1 year ago

Possible Solution/Workaround.

I seemed to have been able to generate the picture all the way to the end. It looks like if any call to the graphics card takes longer than 2 seconds, Windows will kill the process to prevent the computer from seizing due to graphical issues. You can disable it in the following manner.

Install Visual Studio (https://visualstudio.microsoft.com/). You need the full IDE. Visual Studio Code isn't enough.
Install Cuda Toolkit 11.7 (https://developer.nvidia.com/cuda-11-7-0-download-archive)
Start NSight Monitor
Click on the tray icon for NSight Monitor to open up the window.
Click on NSight Monitor Options in the bottom right
Under General, set WDDM TDR Enabled to False
Restart the comptuer

This should disable the timeout and allow for the larger images to process.

Ideally, we should ensure that a single pass doesn't extend past 2 seconds, but I don't know if that's a feasible at this time. This is a fine workaround for the time being, though Nvidia seems to warn that it may cause some system instability. While we may consider this issue resolved, I think we should document it somewhere.

KataiKi commented 1 year ago

Further testing seems to show that there seems to be additional issues beyond this error. The image is generated, and the image is saved to disk. However, the UI fails to update and the completed image is not presented to the user. The following error shows up in the browser's console

Uncaught (in promise) TypeError: L[jt[Gt]] is undefined

Uncaught (in promise) SyntaxError: JSON.parse: unexpected character at line 1 column 1 of the JSON data

I'm guessing the json data that gets presented to gradio is malformed. Possibly a null value would be my best guess. I'm thinking that, due to the delay of the CUDA processing (>2 seconds per action), SD is pulling the finished data too early, handing Gradio some bad data used to update the UI.

The UI hangs after this error, requiring a refresh. The service remains functional, however. Likely unrelated to the original issue

Skymaster3 commented 1 year ago

Possible Solution/Workaround.

I seemed to have been able to generate the picture all the way to the end. It looks like if any call to the graphics card takes longer than 2 seconds, Windows will kill the process to prevent the computer from seizing due to graphical issues. You can disable it in the following manner.

Install Visual Studio (https://visualstudio.microsoft.com/). You need the full IDE. Visual Studio Code isn't enough.

Install Cuda Toolkit 11.7 (https://developer.nvidia.com/cuda-11-7-0-download-archive)

Start NSight Monitor

Click on the tray icon for NSight Monitor to open up the window.

Click on NSight Monitor Options in the bottom right

Under General, set WDDM TDR Enabled to False

Restart the comptuer

This should disable the timeout and allow for the larger images to process.

Ideally, we should ensure that a single pass doesn't extend past 2 seconds, but I don't know if that's a feasible at this time. This is a fine workaround for the time being, though Nvidia seems to warn that it may cause some system instability. While we may consider this issue resolved, I think we should document it somewhere.

This really fixes the error "cuDNN error: CUDNN_STATUS_MAPPING_ERROR", thank you.

Now I have another error:

100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [03:20<00:00, 10.04s/it]
100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [13:49<00:00, 41.45s/it]
Error completing request███████████████████████████████████████████████████████████████| 40/40 [17:38<00:00, 41.45s/it]
Arguments: ('task(s32gkgq2fuaso0b)', '((masterpiece,best quality)),1girl, solo, animal ears, rabbit, barefoot, knees up, dress, sitting, rabbit ears, short sleeves, looking at viewer, grass, short hair, smile, white hair, puffy sleeves, outdoors, puffy short sleeves, bangs, on ground, full body, animal, white dress, sunlight, brown eyes, dappled sunlight, day, depth of field', 'EasyNegative, extra fingers,fewer fingers,', [], 20, 15, False, False, 1, 1, 10, 2337269170.0, -1.0, 0, 0, 0, False, 832, 512, True, 0.6, 1.8, 'Latent', 0, 0, 0, [], 0, False, False, 'positive', 'comma', 0, False, False, '', 1, '', 0, '', 0, '', True, False, False, False, 0) {}
Traceback (most recent call last):
  File "E:\NovelAi\stable-diffusion-webui\modules\call_queue.py", line 56, in f
    res = list(func(*args, **kwargs))
  File "E:\NovelAi\stable-diffusion-webui\modules\call_queue.py", line 37, in f
    res = func(*args, **kwargs)
  File "E:\NovelAi\stable-diffusion-webui\modules\txt2img.py", line 56, in txt2img
    processed = process_images(p)
  File "E:\NovelAi\stable-diffusion-webui\modules\processing.py", line 486, in process_images
    res = process_images_inner(p)
  File "E:\NovelAi\stable-diffusion-webui\modules\processing.py", line 634, in process_images_inner
    x_samples_ddim = [decode_first_stage(p.sd_model, samples_ddim[i:i+1].to(dtype=devices.dtype_vae))[0].cpu() for i in range(samples_ddim.size(0))]
  File "E:\NovelAi\stable-diffusion-webui\modules\processing.py", line 634, in <listcomp>
    x_samples_ddim = [decode_first_stage(p.sd_model, samples_ddim[i:i+1].to(dtype=devices.dtype_vae))[0].cpu() for i in range(samples_ddim.size(0))]
  File "E:\NovelAi\stable-diffusion-webui\modules\processing.py", line 423, in decode_first_stage
    x = model.decode_first_stage(x)
  File "E:\NovelAi\stable-diffusion-webui\modules\sd_hijack_utils.py", line 17, in <lambda>
    setattr(resolved_obj, func_path[-1], lambda *args, **kwargs: self(*args, **kwargs))
  File "E:\NovelAi\stable-diffusion-webui\modules\sd_hijack_utils.py", line 28, in __call__
    return self.__orig_func(*args, **kwargs)
  File "E:\NovelAi\stable-diffusion-webui\venv\lib\site-packages\torch\autograd\grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "E:\NovelAi\stable-diffusion-webui\repositories\stable-diffusion-stability-ai\ldm\models\diffusion\ddpm.py", line 826, in decode_first_stage
    return self.first_stage_model.decode(z)
  File "E:\NovelAi\stable-diffusion-webui\modules\lowvram.py", line 52, in first_stage_model_decode_wrap
    return first_stage_model_decode(z)
  File "E:\NovelAi\stable-diffusion-webui\repositories\stable-diffusion-stability-ai\ldm\models\autoencoder.py", line 90, in decode
    dec = self.decoder(z)
  File "E:\NovelAi\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "E:\NovelAi\stable-diffusion-webui\repositories\stable-diffusion-stability-ai\ldm\modules\diffusionmodules\model.py", line 637, in forward
    h = self.up[i_level].block[i_block](h, temb)
  File "E:\NovelAi\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "E:\NovelAi\stable-diffusion-webui\repositories\stable-diffusion-stability-ai\ldm\modules\diffusionmodules\model.py", line 131, in forward
    h = self.norm1(h)
  File "E:\NovelAi\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "E:\NovelAi\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\normalization.py", line 273, in forward
    return F.group_norm(
  File "E:\NovelAi\stable-diffusion-webui\venv\lib\site-packages\torch\nn\functional.py", line 2528, in group_norm
    return torch.group_norm(input, num_groups, weight, bias, eps, torch.backends.cudnn.enabled)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 674.00 MiB (GPU 0; 2.00 GiB total capacity; 1.49 GiB already allocated; 0 bytes free; 1.51 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

The image generation itself goes without problems(20/20) and I get the result. After that the generation "hires. fix" goes completely(20/20), but ends up with the above error. I still don't understand why it tries to allocate so much memory with my parameters. My parameters: set COMMANDLINE_ARGS=--lowvram --opt-sub-quad-attention --opt-channelslast --always-batch-cond-uncond set CUDA_LAUNCH_BLOCKING=1 set PYTORCH_CUDA_ALLOC_CONF=garbage_collection_threshold:0.6,max_split_size_mb:128

CUDA version: 11.7 Torch version: 1.13.1+cu117

rkfg commented 1 year ago

Same happens randomly on Linux with Torch 2.0.1+cu118. Looks like the probability of the crash is higher when generating a batch with hires fix enabled. There's also a line in dmesg that reads GPU error detected: NVRM: Xid (PCI:0000:01:00): 31, pid=1682203, name=python, Ch 00000028, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_6 faulted @ 0x7f5a_d4851000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ. As far as I remember it never happened with Torch 1.13 because it probably didn't use cuDNN. Not sure about it but it was noticeably slower than 2.0.1 I use now, I got about 30-50% speed boost after updating.

AUTOMATIC1111 / stable-diffusion-webui