Open KataiKi opened 1 year ago
Same problem
After some research, maybe this is something can be solved on the OS side.
https://forums.developer.nvidia.com/t/cuda-the-launch-timed-out-and-was-terminated/20337/7
Launch timeouts normally occur because the kernel is taking too long to run on a GPU which has an active display. The driver will kill kernels taking more than a few seconds to complete. The reason why commenting out that line allows the kernel of complete without the timeout is because without the global memory writes, most of the kernel code will be removed by compiler optimsation, leaving you with an empty kernel.
The solution is to reduce the kernel execution time, either by doing less work per kernel call or improving the code efficiency, or some combination of both. The othe alternative is to use a dedicated compute card, which eliminates the display driver time limit altogether.
Some more additional resources of similar matters
The easiest way to disable TDR for Cuda programming, assuming you have the NVIDIA Nsight tools installed, is to open the Nsight Monitor, click on "Nsight Monitor options", and under "General" set "WDDM TDR enabled" to false. This will change the registry setting for you. Close and reboot. Any change to the TDR registry setting won't take effect until you reboot.
Possible Solution/Workaround.
I seemed to have been able to generate the picture all the way to the end. It looks like if any call to the graphics card takes longer than 2 seconds, Windows will kill the process to prevent the computer from seizing due to graphical issues. You can disable it in the following manner.
This should disable the timeout and allow for the larger images to process.
Ideally, we should ensure that a single pass doesn't extend past 2 seconds, but I don't know if that's a feasible at this time. This is a fine workaround for the time being, though Nvidia seems to warn that it may cause some system instability. While we may consider this issue resolved, I think we should document it somewhere.
Further testing seems to show that there seems to be additional issues beyond this error. The image is generated, and the image is saved to disk. However, the UI fails to update and the completed image is not presented to the user. The following error shows up in the browser's console
Uncaught (in promise) TypeError: L[jt[Gt]] is undefined
Uncaught (in promise) SyntaxError: JSON.parse: unexpected character at line 1 column 1 of the JSON data
I'm guessing the json data that gets presented to gradio is malformed. Possibly a null value would be my best guess. I'm thinking that, due to the delay of the CUDA processing (>2 seconds per action), SD is pulling the finished data too early, handing Gradio some bad data used to update the UI.
The UI hangs after this error, requiring a refresh. The service remains functional, however. Likely unrelated to the original issue
Possible Solution/Workaround.
I seemed to have been able to generate the picture all the way to the end. It looks like if any call to the graphics card takes longer than 2 seconds, Windows will kill the process to prevent the computer from seizing due to graphical issues. You can disable it in the following manner.
- Install Visual Studio (https://visualstudio.microsoft.com/). You need the full IDE. Visual Studio Code isn't enough.
- Install Cuda Toolkit 11.7 (https://developer.nvidia.com/cuda-11-7-0-download-archive)
- Start NSight Monitor
- Click on the tray icon for NSight Monitor to open up the window.
- Click on NSight Monitor Options in the bottom right
- Under General, set WDDM TDR Enabled to False
- Restart the comptuer
This should disable the timeout and allow for the larger images to process.
Ideally, we should ensure that a single pass doesn't extend past 2 seconds, but I don't know if that's a feasible at this time. This is a fine workaround for the time being, though Nvidia seems to warn that it may cause some system instability. While we may consider this issue resolved, I think we should document it somewhere.
This really fixes the error "cuDNN error: CUDNN_STATUS_MAPPING_ERROR", thank you.
Now I have another error:
100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [03:20<00:00, 10.04s/it]
100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [13:49<00:00, 41.45s/it]
Error completing request███████████████████████████████████████████████████████████████| 40/40 [17:38<00:00, 41.45s/it]
Arguments: ('task(s32gkgq2fuaso0b)', '((masterpiece,best quality)),1girl, solo, animal ears, rabbit, barefoot, knees up, dress, sitting, rabbit ears, short sleeves, looking at viewer, grass, short hair, smile, white hair, puffy sleeves, outdoors, puffy short sleeves, bangs, on ground, full body, animal, white dress, sunlight, brown eyes, dappled sunlight, day, depth of field', 'EasyNegative, extra fingers,fewer fingers,', [], 20, 15, False, False, 1, 1, 10, 2337269170.0, -1.0, 0, 0, 0, False, 832, 512, True, 0.6, 1.8, 'Latent', 0, 0, 0, [], 0, False, False, 'positive', 'comma', 0, False, False, '', 1, '', 0, '', 0, '', True, False, False, False, 0) {}
Traceback (most recent call last):
File "E:\NovelAi\stable-diffusion-webui\modules\call_queue.py", line 56, in f
res = list(func(*args, **kwargs))
File "E:\NovelAi\stable-diffusion-webui\modules\call_queue.py", line 37, in f
res = func(*args, **kwargs)
File "E:\NovelAi\stable-diffusion-webui\modules\txt2img.py", line 56, in txt2img
processed = process_images(p)
File "E:\NovelAi\stable-diffusion-webui\modules\processing.py", line 486, in process_images
res = process_images_inner(p)
File "E:\NovelAi\stable-diffusion-webui\modules\processing.py", line 634, in process_images_inner
x_samples_ddim = [decode_first_stage(p.sd_model, samples_ddim[i:i+1].to(dtype=devices.dtype_vae))[0].cpu() for i in range(samples_ddim.size(0))]
File "E:\NovelAi\stable-diffusion-webui\modules\processing.py", line 634, in <listcomp>
x_samples_ddim = [decode_first_stage(p.sd_model, samples_ddim[i:i+1].to(dtype=devices.dtype_vae))[0].cpu() for i in range(samples_ddim.size(0))]
File "E:\NovelAi\stable-diffusion-webui\modules\processing.py", line 423, in decode_first_stage
x = model.decode_first_stage(x)
File "E:\NovelAi\stable-diffusion-webui\modules\sd_hijack_utils.py", line 17, in <lambda>
setattr(resolved_obj, func_path[-1], lambda *args, **kwargs: self(*args, **kwargs))
File "E:\NovelAi\stable-diffusion-webui\modules\sd_hijack_utils.py", line 28, in __call__
return self.__orig_func(*args, **kwargs)
File "E:\NovelAi\stable-diffusion-webui\venv\lib\site-packages\torch\autograd\grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "E:\NovelAi\stable-diffusion-webui\repositories\stable-diffusion-stability-ai\ldm\models\diffusion\ddpm.py", line 826, in decode_first_stage
return self.first_stage_model.decode(z)
File "E:\NovelAi\stable-diffusion-webui\modules\lowvram.py", line 52, in first_stage_model_decode_wrap
return first_stage_model_decode(z)
File "E:\NovelAi\stable-diffusion-webui\repositories\stable-diffusion-stability-ai\ldm\models\autoencoder.py", line 90, in decode
dec = self.decoder(z)
File "E:\NovelAi\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "E:\NovelAi\stable-diffusion-webui\repositories\stable-diffusion-stability-ai\ldm\modules\diffusionmodules\model.py", line 637, in forward
h = self.up[i_level].block[i_block](h, temb)
File "E:\NovelAi\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "E:\NovelAi\stable-diffusion-webui\repositories\stable-diffusion-stability-ai\ldm\modules\diffusionmodules\model.py", line 131, in forward
h = self.norm1(h)
File "E:\NovelAi\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "E:\NovelAi\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\normalization.py", line 273, in forward
return F.group_norm(
File "E:\NovelAi\stable-diffusion-webui\venv\lib\site-packages\torch\nn\functional.py", line 2528, in group_norm
return torch.group_norm(input, num_groups, weight, bias, eps, torch.backends.cudnn.enabled)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 674.00 MiB (GPU 0; 2.00 GiB total capacity; 1.49 GiB already allocated; 0 bytes free; 1.51 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
The image generation itself goes without problems(20/20) and I get the result. After that the generation "hires. fix" goes completely(20/20), but ends up with the above error. I still don't understand why it tries to allocate so much memory with my parameters. My parameters: set COMMANDLINE_ARGS=--lowvram --opt-sub-quad-attention --opt-channelslast --always-batch-cond-uncond set CUDA_LAUNCH_BLOCKING=1 set PYTORCH_CUDA_ALLOC_CONF=garbage_collection_threshold:0.6,max_split_size_mb:128
CUDA version: 11.7 Torch version: 1.13.1+cu117
Same happens randomly on Linux with Torch 2.0.1+cu118. Looks like the probability of the crash is higher when generating a batch with hires fix enabled. There's also a line in dmesg that reads GPU error detected: NVRM: Xid (PCI:0000:01:00): 31, pid=1682203, name=python, Ch 00000028, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_6 faulted @ 0x7f5a_d4851000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
. As far as I remember it never happened with Torch 1.13 because it probably didn't use cuDNN. Not sure about it but it was noticeably slower than 2.0.1 I use now, I got about 30-50% speed boost after updating.
Is there an existing issue for this?
What happened?
I'm unsure if this is a bug, a configuration issue, or a hardware limitation.
I am trying to run a generation at higher resolutions (1024x1024). Generations at 512x512 work fine, as do generations at 1024x512. When I start a generation at 1024x1024, it seems to run just fine. When the generation is complete, however, the entire service crashes with the following error.
With CUDA_LAUNCH_BLOCKING=1
A second attempt ends up with this message
The service stops responding altogether and needs to be closed and restarted.
The previews look like everything is working fine. I get to see the finished product for a split second before the service crashes. Nothing gets saved.
The GPU i'm using is pretty slim, a GTX 970 4GB. I'm using a lot of options to reduce the load on the VRAM, but this doesn't look like the usual error for running into VRAM issues.
Google searches seem to point to it being a CUDA/Torch issue: https://github.com/pytorch/pytorch/issues/27588
Steps to reproduce the problem
What should have happened?
The process should've completed and the image saved properly to disk.
Commit where the problem happens
0cc0ee1b
What platforms do you use to access the UI ?
Windows
What browsers do you use to access the UI ?
Mozilla Firefox, Google Chrome
Command Line Arguments
List of extensions
No
Console logs
With CUDA_LAUNCH_BLOCKING=1.
Additional information
GTX 970 4GB