Error training LoRA. NotImplementedError: No operator found for `memory_efficient_attention_forward` with inputs...

Hi. I'm trying to run the Dreambooth LoRA Training on a Kaggle notebook, with GPU P100 Accelerator. I'm getting this error while executing the 5.5. Start Training code cell. I tried changing XFormers version (0.0.16, 0.0.17, 0.0.18, 0.0.19...) but anything solves the issue. Currently Torch 2 is being used. I don't know if there's a way to change it without breaking everything, so I haven't tested with torch 1.3.

This is the result of doing !python -m xformers.info:

xFormers 0.0.18
memory_efficient_attention.cutlassF:               unavailable
memory_efficient_attention.cutlassB:               unavailable
memory_efficient_attention.flshattF:               unavailable
memory_efficient_attention.flshattB:               unavailable
memory_efficient_attention.smallkF:                unavailable
memory_efficient_attention.smallkB:                unavailable
memory_efficient_attention.tritonflashattF:        available
memory_efficient_attention.tritonflashattB:        available
indexing.scaled_index_addF:                        unavailable
indexing.scaled_index_addB:                        unavailable
indexing.index_select:                             unavailable
swiglu.dual_gemm_silu:                             unavailable
swiglu.gemm_fused_operand_sum:                     unavailable
swiglu.fused.p.cpp:                                not built
is_triton_available:                               True
is_functorch_available:                            False
pytorch.version:                                   2.0.0
pytorch.cuda:                                      available
gpu.compute_capability:                            6.0
gpu.name:                                          Tesla P100-PCIE-16GB
build.info:                                        available
build.cuda_version:                                1108
build.python_version:                              3.10.10
build.torch_version:                               2.0.0+cu118
build.env.TORCH_CUDA_ARCH_LIST:                    5.0+PTX 6.0 6.1 7.0 7.5 8.0 8.6
build.env.XFORMERS_BUILD_TYPE:                     Release
build.env.XFORMERS_ENABLE_DEBUG_ASSERTIONS:        None
build.env.NVCC_FLAGS:                              None
build.env.XFORMERS_PACKAGE_FROM:                   wheel-v0.0.18
source.privacy:                                    open source

The actual error:

/opt/conda/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.23.5
  warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/__init__.py:98: UserWarning: unable to load libtensorflow_io_plugins.so: unable to open file: libtensorflow_io_plugins.so, from paths: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io_plugins.so']
caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io_plugins.so: undefined symbol: _ZN3tsl6StatusC1EN10tensorflow5error4CodeESt17basic_string_viewIcSt11char_traitsIcEENS_14SourceLocationE']
  warnings.warn(f"unable to load libtensorflow_io_plugins.so: {e}")
/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/__init__.py:104: UserWarning: file system plugins are not loaded: unable to open file: libtensorflow_io.so, from paths: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io.so']
caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io.so: undefined symbol: _ZTVN10tensorflow13GcsFileSystemE']
  warnings.warn(f"file system plugins are not loaded: {e}")
/opt/conda/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.23.5
  warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/__init__.py:98: UserWarning: unable to load libtensorflow_io_plugins.so: unable to open file: libtensorflow_io_plugins.so, from paths: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io_plugins.so']
caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io_plugins.so: undefined symbol: _ZN3tsl6StatusC1EN10tensorflow5error4CodeESt17basic_string_viewIcSt11char_traitsIcEENS_14SourceLocationE']
  warnings.warn(f"unable to load libtensorflow_io_plugins.so: {e}")
/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/__init__.py:104: UserWarning: file system plugins are not loaded: unable to open file: libtensorflow_io.so, from paths: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io.so']
caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io.so: undefined symbol: _ZTVN10tensorflow13GcsFileSystemE']
  warnings.warn(f"file system plugins are not loaded: {e}")
Loading settings from /kaggle/working/LoRA/config/config_file.toml...
/kaggle/working/LoRA/config/config_file
prepare tokenizer
Downloading (…)olve/main/vocab.json: 100%|███| 961k/961k [00:00<00:00, 2.96MB/s]
Downloading (…)olve/main/merges.txt: 100%|███| 525k/525k [00:00<00:00, 2.14MB/s]
Downloading (…)cial_tokens_map.json: 100%|██████| 389/389 [00:00<00:00, 332kB/s]
Downloading (…)okenizer_config.json: 100%|██████| 905/905 [00:00<00:00, 778kB/s]
update token length: 225
Load dataset config from /kaggle/working/LoRA/config/dataset_config.toml
prepare images.
found directory /kaggle/input/data-img/upscaled_v2_prepr contains 92 image files
920 train images with repeating.
0 reg images.
no regularization images / 正則化画像が見つかりませんでした
[Dataset 0]
  batch_size: 1
  resolution: (512, 512)
  enable_bucket: True
  min_bucket_reso: 256
  max_bucket_reso: 1024
  bucket_reso_steps: 64
  bucket_no_upscale: False

  [Subset 0 of Dataset 0]
    image_dir: "/kaggle/input/data-img/upscaled_v2_prepr"
    image_count: 92
    num_repeats: 10
    shuffle_caption: True
    keep_tokens: 0
    caption_dropout_rate: 0
    caption_dropout_every_n_epoches: 0
    caption_tag_dropout_rate: 0
    color_aug: False
    flip_aug: False
    face_crop_aug_range: None
    random_crop: False
    token_warmup_min: 1,
    token_warmup_step: 0,
    is_reg: False
    class_tokens: mksks
    caption_extension: .txt

[Dataset 0]
loading image sizes.
100%|██████████████████████████████████████████| 92/92 [00:00<00:00, 118.18it/s]
make buckets
number of images (including repeats) / 各bucketの画像枚数（繰り返し回数を含む）
bucket 0: resolution (512, 512), count: 920
mean ar error (without repeats): 0.0
prepare accelerator
Using accelerator 0.15.0 or above.
loading model for process 0/1
load StableDiffusion checkpoint
loading u-net: <All keys matched successfully>
loading vae: <All keys matched successfully>
Downloading (…)lve/main/config.json: 100%|█| 4.52k/4.52k [00:00<00:00, 3.35MB/s]
Downloading pytorch_model.bin: 100%|███████| 1.71G/1.71G [00:30<00:00, 55.3MB/s]
loading text encoder: <All keys matched successfully>
load VAE: /kaggle/working/vae/stablediffusion.vae.pt
additional VAE loaded
Replace CrossAttention.forward to use xformers
[Dataset 0]
caching latents.
100%|███████████████████████████████████████████| 23/23 [00:18<00:00,  1.22it/s]
import network module: networks.lora
create LoRA network. base dim (rank): 32, alpha: 16
create LoRA for Text Encoder: 72 modules.
create LoRA for U-Net: 192 modules.
enable LoRA for text encoder
enable LoRA for U-Net
prepare optimizer, data loader etc.

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
For effortless bug reporting copy-paste your error into this form: https://docs.google.com/forms/d/e/1FAIpQLScPB8emS3Thkp66nvqwmjTEgxp8Y9ufuWTzFyr9kJ5AoI47dQ/viewform?usp=sf_link
================================================================================
CUDA SETUP: CUDA runtime path found: /opt/conda/lib/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 6.0
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /opt/conda/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117_nocublaslt.so...
use 8-bit AdamW optimizer | {}
override steps. steps for 5 epochs is / 指定エポックまでのステップ数: 4600
running training / 学習開始
  num train images * repeats / 学習画像の数×繰り返し回数: 920
  num reg images / 正則化画像の数: 0
  num batches per epoch / 1epochのバッチ数: 920
  num epochs / epoch数: 5
  batch size per device / バッチサイズ: 1
  gradient accumulation steps / 勾配を合計するステップ数 = 1
  total optimization steps / 学習ステップ数: 4600
steps:   0%|                                           | 0/4600 [00:00<?, ?it/s]epoch 1/5
╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /kaggle/working/kohya-trainer/train_network.py:752 in <module>               │
│                                                                              │
│   749 │   args = parser.parse_args()                                         │
│   750 │   args = train_util.read_config_from_file(args, parser)              │
│   751 │                                                                      │
│ ❱ 752 │   train(args)                                                        │
│   753                                                                        │
│                                                                              │
│ /kaggle/working/kohya-trainer/train_network.py:583 in train                  │
│                                                                              │
│   580 │   │   │   │                                                          │
│   581 │   │   │   │   # Predict the noise residual                           │
│   582 │   │   │   │   with accelerator.autocast():                           │
│ ❱ 583 │   │   │   │   │   noise_pred = unet(noisy_latents, timesteps, encode │
│   584 │   │   │   │                                                          │
│   585 │   │   │   │   if args.v_parameterization:                            │
│   586 │   │   │   │   │   # v-parameterization training                      │
│                                                                              │
│ /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1501 in   │
│ _call_impl                                                                   │
│                                                                              │
│   1498 │   │   if not (self._backward_hooks or self._backward_pre_hooks or s │
│   1499 │   │   │   │   or _global_backward_pre_hooks or _global_backward_hoo │
│   1500 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks │
│ ❱ 1501 │   │   │   return forward_call(*args, **kwargs)                      │
│   1502 │   │   # Do not call functions when jit is used                      │
│   1503 │   │   full_backward_hooks, non_full_backward_hooks = [], []         │
│   1504 │   │   backward_pre_hooks = []                                       │
│                                                                              │
│ /opt/conda/lib/python3.10/site-packages/accelerate/utils/operations.py:490   │
│ in __call__                                                                  │
│                                                                              │
│   487 │   │   update_wrapper(self, model_forward)                            │
│   488 │                                                                      │
│   489 │   def __call__(self, *args, **kwargs):                               │
│ ❱ 490 │   │   return convert_to_fp32(self.model_forward(*args, **kwargs))    │
│   491 │                                                                      │
│   492 │   def __getstate__(self):                                            │
│   493 │   │   raise pickle.PicklingError(                                    │
│                                                                              │
│ /opt/conda/lib/python3.10/site-packages/torch/amp/autocast_mode.py:14 in     │
│ decorate_autocast                                                            │
│                                                                              │
│    11 │   @functools.wraps(func)                                             │
│    12 │   def decorate_autocast(*args, **kwargs):                            │
│    13 │   │   with autocast_instance:                                        │
│ ❱  14 │   │   │   return func(*args, **kwargs)                               │
│    15 │   decorate_autocast.__script_unsupported = '@autocast() decorator is │
│    16 │   return decorate_autocast                                           │
│    17                                                                        │
│                                                                              │
│ /opt/conda/lib/python3.10/site-packages/diffusers/models/unet_2d_condition.p │
│ y:381 in forward                                                             │
│                                                                              │
│   378 │   │   down_block_res_samples = (sample,)                             │
│   379 │   │   for downsample_block in self.down_blocks:                      │
│   380 │   │   │   if hasattr(downsample_block, "has_cross_attention") and do │
│ ❱ 381 │   │   │   │   sample, res_samples = downsample_block(                │
│   382 │   │   │   │   │   hidden_states=sample,                              │
│   383 │   │   │   │   │   temb=emb,                                          │
│   384 │   │   │   │   │   encoder_hidden_states=encoder_hidden_states,       │
│                                                                              │
│ /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1501 in   │
│ _call_impl                                                                   │
│                                                                              │
│   1498 │   │   if not (self._backward_hooks or self._backward_pre_hooks or s │
│   1499 │   │   │   │   or _global_backward_pre_hooks or _global_backward_hoo │
│   1500 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks │
│ ❱ 1501 │   │   │   return forward_call(*args, **kwargs)                      │
│   1502 │   │   # Do not call functions when jit is used                      │
│   1503 │   │   full_backward_hooks, non_full_backward_hooks = [], []         │
│   1504 │   │   backward_pre_hooks = []                                       │
│                                                                              │
│ /opt/conda/lib/python3.10/site-packages/diffusers/models/unet_2d_blocks.py:6 │
│ 07 in forward                                                                │
│                                                                              │
│    604 │   │   │   │   │   return custom_forward                             │
│    605 │   │   │   │                                                         │
│    606 │   │   │   │   hidden_states = torch.utils.checkpoint.checkpoint(cre │
│ ❱  607 │   │   │   │   hidden_states = torch.utils.checkpoint.checkpoint(    │
│    608 │   │   │   │   │   create_custom_forward(attn, return_dict=False), h │
│    609 │   │   │   │   )[0]                                                  │
│    610 │   │   │   else:                                                     │
│                                                                              │
│ /opt/conda/lib/python3.10/site-packages/torch/utils/checkpoint.py:249 in     │
│ checkpoint                                                                   │
│                                                                              │
│   246 │   │   raise ValueError("Unexpected keyword arguments: " + ",".join(a │
│   247 │                                                                      │
│   248 │   if use_reentrant:                                                  │
│ ❱ 249 │   │   return CheckpointFunction.apply(function, preserve, *args)     │
│   250 │   else:                                                              │
│   251 │   │   return _checkpoint_without_reentrant(                          │
│   252 │   │   │   function,                                                  │
│                                                                              │
│ /opt/conda/lib/python3.10/site-packages/torch/autograd/function.py:506 in    │
│ apply                                                                        │
│                                                                              │
│   503 │   │   if not torch._C._are_functorch_transforms_active():            │
│   504 │   │   │   # See NOTE: [functorch vjp and autograd interaction]       │
│   505 │   │   │   args = _functorch.utils.unwrap_dead_wrappers(args)         │
│ ❱ 506 │   │   │   return super().apply(*args, **kwargs)  # type: ignore[misc │
│   507 │   │                                                                  │
│   508 │   │   if cls.setup_context == _SingleLevelFunction.setup_context:    │
│   509 │   │   │   raise RuntimeError(                                        │
│                                                                              │
│ /opt/conda/lib/python3.10/site-packages/torch/utils/checkpoint.py:107 in     │
│ forward                                                                      │
│                                                                              │
│   104 │   │   ctx.save_for_backward(*tensor_inputs)                          │
│   105 │   │                                                                  │
│   106 │   │   with torch.no_grad():                                          │
│ ❱ 107 │   │   │   outputs = run_function(*args)                              │
│   108 │   │   return outputs                                                 │
│   109 │                                                                      │
│   110 │   @staticmethod                                                      │
│                                                                              │
│ /opt/conda/lib/python3.10/site-packages/diffusers/models/unet_2d_blocks.py:6 │
│ 00 in custom_forward                                                         │
│                                                                              │
│    597 │   │   │   │   def create_custom_forward(module, return_dict=None):  │
│    598 │   │   │   │   │   def custom_forward(*inputs):                      │
│    599 │   │   │   │   │   │   if return_dict is not None:                   │
│ ❱  600 │   │   │   │   │   │   │   return module(*inputs, return_dict=return │
│    601 │   │   │   │   │   │   else:                                         │
│    602 │   │   │   │   │   │   │   return module(*inputs)                    │
│    603                                                                       │
│                                                                              │
│ /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1501 in   │
│ _call_impl                                                                   │
│                                                                              │
│   1498 │   │   if not (self._backward_hooks or self._backward_pre_hooks or s │
│   1499 │   │   │   │   or _global_backward_pre_hooks or _global_backward_hoo │
│   1500 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks │
│ ❱ 1501 │   │   │   return forward_call(*args, **kwargs)                      │
│   1502 │   │   # Do not call functions when jit is used                      │
│   1503 │   │   full_backward_hooks, non_full_backward_hooks = [], []         │
│   1504 │   │   backward_pre_hooks = []                                       │
│                                                                              │
│ /opt/conda/lib/python3.10/site-packages/diffusers/models/attention.py:216 in │
│ forward                                                                      │
│                                                                              │
│   213 │   │                                                                  │
│   214 │   │   # 2. Blocks                                                    │
│   215 │   │   for block in self.transformer_blocks:                          │
│ ❱ 216 │   │   │   hidden_states = block(hidden_states, context=encoder_hidde │
│   217 │   │                                                                  │
│   218 │   │   # 3. Output                                                    │
│   219 │   │   if self.is_input_continuous:                                   │
│                                                                              │
│ /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1501 in   │
│ _call_impl                                                                   │
│                                                                              │
│   1498 │   │   if not (self._backward_hooks or self._backward_pre_hooks or s │
│   1499 │   │   │   │   or _global_backward_pre_hooks or _global_backward_hoo │
│   1500 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks │
│ ❱ 1501 │   │   │   return forward_call(*args, **kwargs)                      │
│   1502 │   │   # Do not call functions when jit is used                      │
│   1503 │   │   full_backward_hooks, non_full_backward_hooks = [], []         │
│   1504 │   │   backward_pre_hooks = []                                       │
│                                                                              │
│ /opt/conda/lib/python3.10/site-packages/diffusers/models/attention.py:484 in │
│ forward                                                                      │
│                                                                              │
│   481 │   │   if self.only_cross_attention:                                  │
│   482 │   │   │   hidden_states = self.attn1(norm_hidden_states, context) +  │
│   483 │   │   else:                                                          │
│ ❱ 484 │   │   │   hidden_states = self.attn1(norm_hidden_states) + hidden_st │
│   485 │   │                                                                  │
│   486 │   │   if self.attn2 is not None:                                     │
│   487 │   │   │   # 2. Cross-Attention                                       │
│                                                                              │
│ /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1501 in   │
│ _call_impl                                                                   │
│                                                                              │
│   1498 │   │   if not (self._backward_hooks or self._backward_pre_hooks or s │
│   1499 │   │   │   │   or _global_backward_pre_hooks or _global_backward_hoo │
│   1500 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks │
│ ❱ 1501 │   │   │   return forward_call(*args, **kwargs)                      │
│   1502 │   │   # Do not call functions when jit is used                      │
│   1503 │   │   full_backward_hooks, non_full_backward_hooks = [], []         │
│   1504 │   │   backward_pre_hooks = []                                       │
│                                                                              │
│ /kaggle/working/kohya-trainer/library/train_util.py:1792 in forward_xformers │
│                                                                              │
│   1789 │   │   q = q.contiguous()                                            │
│   1790 │   │   k = k.contiguous()                                            │
│   1791 │   │   v = v.contiguous()                                            │
│ ❱ 1792 │   │   out = xformers.ops.memory_efficient_attention(q, k, v, attn_b │
│   1793 │   │                                                                 │
│   1794 │   │   out = rearrange(out, "b n h d -> b n (h d)", h=h)             │
│   1795                                                                       │
│                                                                              │
│ /opt/conda/lib/python3.10/site-packages/xformers/ops/fmha/__init__.py:196 in │
│ memory_efficient_attention                                                   │
│                                                                              │
│   193 │   │   and options.                                                   │
│   194 │   :return: multi-head attention Tensor with shape ``[B, Mq, H, Kv]`` │
│   195 │   """                                                                │
│ ❱ 196 │   return _memory_efficient_attention(                                │
│   197 │   │   Inputs(                                                        │
│   198 │   │   │   query=query, key=key, value=value, p=p, attn_bias=attn_bia │
│   199 │   │   ),                                                             │
│                                                                              │
│ /opt/conda/lib/python3.10/site-packages/xformers/ops/fmha/__init__.py:294 in │
│ _memory_efficient_attention                                                  │
│                                                                              │
│   291 ) -> torch.Tensor:                                                     │
│   292 │   # fast-path that doesn't require computing the logsumexp for backw │
│   293 │   if all(x.requires_grad is False for x in [inp.query, inp.key, inp. │
│ ❱ 294 │   │   return _memory_efficient_attention_forward(                    │
│   295 │   │   │   inp, op=op[0] if op is not None else None                  │
│   296 │   │   )                                                              │
│   297                                                                        │
│                                                                              │
│ /opt/conda/lib/python3.10/site-packages/xformers/ops/fmha/__init__.py:310 in │
│ _memory_efficient_attention_forward                                          │
│                                                                              │
│   307 │   inp.validate_inputs()                                              │
│   308 │   output_shape = inp.normalize_bmhk()                                │
│   309 │   if op is None:                                                     │
│ ❱ 310 │   │   op = _dispatch_fw(inp)                                         │
│   311 │   else:                                                              │
│   312 │   │   _ensure_op_supports_or_raise(ValueError, "memory_efficient_att │
│   313                                                                        │
│                                                                              │
│ /opt/conda/lib/python3.10/site-packages/xformers/ops/fmha/dispatch.py:98 in  │
│ _dispatch_fw                                                                 │
│                                                                              │
│    95 │   if _is_triton_fwd_fastest(inp):                                    │
│    96 │   │   priority_list_ops.remove(triton.FwOp)                          │
│    97 │   │   priority_list_ops.insert(0, triton.FwOp)                       │
│ ❱  98 │   return _run_priority_list(                                         │
│    99 │   │   "memory_efficient_attention_forward", priority_list_ops, inp   │
│   100 │   )                                                                  │
│   101                                                                        │
│                                                                              │
│ /opt/conda/lib/python3.10/site-packages/xformers/ops/fmha/dispatch.py:73 in  │
│ _run_priority_list                                                           │
│                                                                              │
│    70 {textwrap.indent(_format_inputs_description(inp), '     ')}"""         │
│    71 │   for op, not_supported in zip(priority_list, not_supported_reasons) │
│    72 │   │   msg += "\n" + _format_not_supported_reasons(op, not_supported) │
│ ❱  73 │   raise NotImplementedError(msg)                                     │
│    74                                                                        │
│    75                                                                        │
│    76 def _dispatch_fw(inp: Inputs) -> Type[AttentionFwOpBase]:              │
╰──────────────────────────────────────────────────────────────────────────────╯
NotImplementedError: No operator found for `memory_efficient_attention_forward` 
with inputs:
     query       : shape=(1, 4096, 8, 40) (torch.float16)
     key         : shape=(1, 4096, 8, 40) (torch.float16)
     value       : shape=(1, 4096, 8, 40) (torch.float16)
     attn_bias   : <class 'NoneType'>
     p           : 0.0
`cutlassF` is not supported because:
    xFormers wasn't build with CUDA support
    Operator wasn't built - see `python -m xformers.info` for more info
`flshattF` is not supported because:
    xFormers wasn't build with CUDA support
    Operator wasn't built - see `python -m xformers.info` for more info
    requires a GPU with compute capability > 7.5
`tritonflashattF` is not supported because:
    xFormers wasn't build with CUDA support
    requires A100 GPU
`smallkF` is not supported because:
    xFormers wasn't build with CUDA support
    dtype=torch.float16 (supported: {torch.float32})
    max(query.shape[-1] != value.shape[-1]) > 32
    Operator wasn't built - see `python -m xformers.info` for more info
    unsupported embed per head: 40
steps:   0%|                                           | 0/4600 [00:01<?, ?it/s]
╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /opt/conda/bin/accelerate:8 in <module>                                      │
│                                                                              │
│   5 from accelerate.commands.accelerate_cli import main                      │
│   6 if __name__ == '__main__':                                               │
│   7 │   sys.argv[0] = re.sub(r'(-script\.pyw|\.exe)?$', '', sys.argv[0])     │
│ ❱ 8 │   sys.exit(main())                                                     │
│   9                                                                          │
│                                                                              │
│ /opt/conda/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.p │
│ y:45 in main                                                                 │
│                                                                              │
│   42 │   │   exit(1)                                                         │
│   43 │                                                                       │
│   44 │   # Run                                                               │
│ ❱ 45 │   args.func(args)                                                     │
│   46                                                                         │
│   47                                                                         │
│   48 if __name__ == "__main__":                                              │
│                                                                              │
│ /opt/conda/lib/python3.10/site-packages/accelerate/commands/launch.py:1104   │
│ in launch_command                                                            │
│                                                                              │
│   1101 │   elif defaults is not None and defaults.compute_environment == Com │
│   1102 │   │   sagemaker_launcher(defaults, args)                            │
│   1103 │   else:                                                             │
│ ❱ 1104 │   │   simple_launcher(args)                                         │
│   1105                                                                       │
│   1106                                                                       │
│   1107 def main():                                                           │
│                                                                              │
│ /opt/conda/lib/python3.10/site-packages/accelerate/commands/launch.py:567 in │
│ simple_launcher                                                              │
│                                                                              │
│    564 │   process = subprocess.Popen(cmd, env=current_env)                  │
│    565 │   process.wait()                                                    │
│    566 │   if process.returncode != 0:                                       │
│ ❱  567 │   │   raise subprocess.CalledProcessError(returncode=process.return │
│    568                                                                       │
│    569                                                                       │
│    570 def multi_gpu_launcher(args):                                         │
╰──────────────────────────────────────────────────────────────────────────────╯
CalledProcessError: Command '['/opt/conda/bin/python3.10', 'train_network.py', 
'--sample_prompts=/kaggle/working/LoRA/config/sample_prompt.txt', 
'--dataset_config=/kaggle/working/LoRA/config/dataset_config.toml', 
'--config_file=/kaggle/working/LoRA/config/config_file.toml']' returned non-zero
exit status 1.

Any help is very appreciated! Thank you.

Linaqruf / kohya-trainer

Error training LoRA. NotImplementedError: No operator found for `memory_efficient_attention_forward` with inputs... #222

This is the result of doing !python -m xformers.info:

The actual error: