[Bug]: cutlassF: no kernel found to launch

Is there an existing issue for this?

[X] I have searched the existing issues and checked the recent builds/commits

What happened?

I install stable-diffusion-webui on a debian11 server which has 256g RAM and one NVIDIA K80 with 24g VRAM, when I use "--xformer", at the end of drawing, it will throw an error: cutlassF: no kernel found to launch. If I stop using "--xformer", it works fine.

Steps to reproduce the problem

Follow the instructions to install by default
Edit webui-user.sh, add export COMMANDLINE_ARGS="--xformers --listen --enable-insecure-extension-access"
Launch 192.168.1.120:7860, txt2img, text：a girl with a dog, When drawing is almost finished, an error will be thrown.

What should have happened?

It should finish drawing.

Version or Commit where the problem happens

version: v1.5.1 • python: 3.10.6 • torch: 2.0.1+cu118 • xformers: 0.0.20 • gradio: 3.32.0 • checkpoint: fc2511737a

What Python version are you running on ?

Python 3.10.x

What platforms do you use to access the UI ?

Linux

What device are you running WebUI on?

Nvidia GPUs (RTX 20 above)

Cross attention optimization

xformers

What browsers do you use to access the UI ?

Google Chrome

Command Line Arguments

export COMMANDLINE_ARGS="--xformers --listen --enable-insecure-extension-access"

List of extensions

Console logs

(venv) stable@dell730:~/stable-diffusion-webui$ source webui.sh

################################################################
Install script for stable-diffusion + Web UI
Tested on Debian 11 (Bullseye)
################################################################

################################################################
Running on stable user
################################################################

################################################################
Repo already cloned, using it as install directory
################################################################

################################################################
python venv already activate or run without venv: /home/stable/stable-diffusion-webui/venv
################################################################

################################################################
Launching launch.py...
################################################################
Cannot locate TCMalloc (improves CPU memory usage)
Python 3.10.6 (main, Aug 22 2023, 16:43:57) [GCC 10.2.1 20210110]
Version: v1.5.1
Commit hash: 68f336bd994bed5442ad95bad6b6ad5564a5409a

Launching Web UI with arguments: --xformers --listen --enable-insecure-extension-access
dirname:  /home/stable/stable-diffusion-webui/localizations
localizations:  {}
2023-08-23 19:31:21,247 - ControlNet - INFO - ControlNet v1.1.306
ControlNet preprocessor location: /home/stable/stable-diffusion-webui/extensions/sd-webui-controlnet/annotator/downloads
2023-08-23 19:31:21,385 - ControlNet - INFO - ControlNet v1.1.306
Loading weights [fc2511737a] from /home/stable/stable-diffusion-webui/models/Stable-diffusion/chilloutmix_NiPrunedFp32Fix.safetensors
Running on local URL:  http://0.0.0.0:7860

To create a public link, set `share=True` in `launch()`.
Startup time: 38.5s (launcher: 13.2s, import torch: 7.8s, import gradio: 2.8s, setup paths: 4.8s, other imports: 4.6s, setup codeformer: 0.4s, load scripts: 1.5s, create ui: 1.5s, gradio launch: 1.8s).
Creating model from config: /home/stable/stable-diffusion-webui/configs/v1-inference.yaml
LatentDiffusion: Running in eps-prediction mode
DiffusionWrapper has 859.52 M params.
Applying attention optimization: Doggettx... done.
Model loaded in 40.6s (load weights from disk: 6.0s, create model: 1.2s, apply weights to model: 24.5s, apply half(): 0.4s, load VAE: 2.2s, move model to device: 5.2s, calculate empty prompt: 1.0s).
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:49<00:00,  2.48s/it]
*** Error completing request█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:45<00:00,  2.42s/it]
*** Arguments: ('task(mrzmc679w5venze)', 'a girl with a dog', '', [], 20, 0, False, False, 1, 1, 7, -1.0, -1.0, 0, 0, 0, False, 512, 512, False, 0.7, 2, 'Latent', 0, 0, 0, 0, '', '', [], <gradio.routes.Request object at 0x7f9b5e7ac070>, 0, <scripts.controlnet_ui.controlnet_ui_group.UiControlNetUnit object at 0x7f9b5fecf1f0>, False, False, 'positive', 'comma', 0, False, False, '', 1, '', [], 0, '', [], 0, '', [], True, False, False, False, 0, None, None, False, 50) {}
    Traceback (most recent call last):
      File "/home/stable/stable-diffusion-webui/modules/call_queue.py", line 58, in f
        res = list(func(*args, **kwargs))
      File "/home/stable/stable-diffusion-webui/modules/call_queue.py", line 37, in f
        res = func(*args, **kwargs)
      File "/home/stable/stable-diffusion-webui/modules/txt2img.py", line 62, in txt2img
        processed = processing.process_images(p)
      File "/home/stable/stable-diffusion-webui/modules/processing.py", line 677, in process_images
        res = process_images_inner(p)
      File "/home/stable/stable-diffusion-webui/extensions/sd-webui-controlnet/scripts/batch_hijack.py", line 42, in processing_process_images_hijack
        return getattr(processing, '__controlnet_original_process_images_inner')(p, *args, **kwargs)
      File "/home/stable/stable-diffusion-webui/modules/processing.py", line 796, in process_images_inner
        x_samples_ddim = decode_latent_batch(p.sd_model, samples_ddim, target_device=devices.cpu, check_for_nans=True)
      File "/home/stable/stable-diffusion-webui/modules/processing.py", line 545, in decode_latent_batch
        sample = decode_first_stage(model, batch[i:i + 1])[0]
      File "/home/stable/stable-diffusion-webui/modules/processing.py", line 576, in decode_first_stage
        x = model.decode_first_stage(x.to(devices.dtype_vae))
      File "/home/stable/stable-diffusion-webui/modules/sd_hijack_utils.py", line 17, in <lambda>
        setattr(resolved_obj, func_path[-1], lambda *args, **kwargs: self(*args, **kwargs))
      File "/home/stable/stable-diffusion-webui/modules/sd_hijack_utils.py", line 28, in __call__
        return self.__orig_func(*args, **kwargs)
      File "/home/stable/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
        return func(*args, **kwargs)
      File "/home/stable/stable-diffusion-webui/repositories/stable-diffusion-stability-ai/ldm/models/diffusion/ddpm.py", line 826, in decode_first_stage
        return self.first_stage_model.decode(z)
      File "/home/stable/stable-diffusion-webui/repositories/stable-diffusion-stability-ai/ldm/models/autoencoder.py", line 90, in decode
        dec = self.decoder(z)
      File "/home/stable/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
        return forward_call(*args, **kwargs)
      File "/home/stable/stable-diffusion-webui/repositories/stable-diffusion-stability-ai/ldm/modules/diffusionmodules/model.py", line 631, in forward
        h = self.mid.attn_1(h)
      File "/home/stable/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
        return forward_call(*args, **kwargs)
      File "/home/stable/stable-diffusion-webui/repositories/stable-diffusion-stability-ai/ldm/modules/diffusionmodules/model.py", line 258, in forward
        out = xformers.ops.memory_efficient_attention(q, k, v, attn_bias=None, op=self.attention_op)
      File "/home/stable/stable-diffusion-webui/venv/lib/python3.10/site-packages/xformers/ops/fmha/__init__.py", line 192, in memory_efficient_attention
        return _memory_efficient_attention(
      File "/home/stable/stable-diffusion-webui/venv/lib/python3.10/site-packages/xformers/ops/fmha/__init__.py", line 290, in _memory_efficient_attention
        return _memory_efficient_attention_forward(
      File "/home/stable/stable-diffusion-webui/venv/lib/python3.10/site-packages/xformers/ops/fmha/__init__.py", line 310, in _memory_efficient_attention_forward
        out, *_ = op.apply(inp, needs_gradient=False)
      File "/home/stable/stable-diffusion-webui/venv/lib/python3.10/site-packages/xformers/ops/fmha/cutlass.py", line 175, in apply
        out, lse, rng_seed, rng_offset = cls.OPERATOR(
      File "/home/stable/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/_ops.py", line 502, in __call__
        return self._op(*args, **kwargs or {})
    RuntimeError: cutlassF: no kernel found to launch!

---

Additional information

No response

AUTOMATIC1111 / stable-diffusion-webui

[Bug]: cutlassF: no kernel found to launch #12740