facebookresearch / xformers

Hackable and optimized Transformers building blocks, supporting a composable construction.
https://facebookresearch.github.io/xformers/
Other
8.41k stars 597 forks source link

No operator found for `memory_efficient_attention_forward` with inputs: #1109

Open brcisna opened 1 week ago

brcisna commented 1 week ago

🐛 Bug

Command

start wunjo V2

To Reproduce

Steps to reproduce the behavior:

briefcase dev # starts wunjo AI V2

1.go to generation tab 2.start image generation 3.in the console after a few seconds of image generation the following appears,,

ERROR No operator found for memory_efficient_attention_forward with inputs: query : shape=(1, 2, 1, 40) (torch.float32) key : shape=(1, 2, 1, 40) (torch.float32) value : shape=(1, 2, 1, 40) (torch.float32) attn_bias : <class 'NoneType'> p : 0.0 ckF is not supported because: dtype=torch.float32 (supported: {torch.bfloat16, torch.float16})

Expected behavior

image is created

Environment

Debian 13 python3.10.12 Pytorch4.2.1_rocm ROCm HIPCC AND Radeon Pro W6600 GPU

Please copy and paste the output from the environment collection script from PyTorch (or fill out the checklist below manually).

You can run the script with:

# For security purposes, please check the contents of collect_env.py before running it.
python -m torch.utils.collect_env
python -m torch.utils.collect_env
/home/superuser/.pyenv/versions/3.10.12/lib/python3.10/runpy.py:126: RuntimeWarning: 'torch.utils.collect_env' found in sys.modules after import of package 'torch.utils', but prior to execution of 'torch.utils.collect_env'; this may result in unpredictable behaviour
  warn(RuntimeWarning(msg))
Collecting environment information...
PyTorch version: 2.4.1+rocm6.1
Is debug build: False
CUDA used to build PyTorch: N/A
ROCM used to build PyTorch: 6.1.40091-a8dbc0c19

OS: Debian GNU/Linux trixie/sid (x86_64)
GCC version: (Debian 14.2.0-3) 14.2.0
Clang version: Could not collect
CMake version: version 3.30.3
Libc version: glibc-2.40

Python version: 3.10.12 (main, Sep 17 2024, 03:58:18) [GCC 14.2.0] (64-bit runtime)
Python platform: Linux-6.10.9-amd64-x86_64-with-glibc2.40
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: AMD Radeon Pro W6600 (gfx1032)
Nvidia driver version: Could not collect
cuDNN version: Could not collect
HIP runtime version: 6.1.40091
MIOpen runtime version: 3.1.0
Is XNNPACK available: True

CPU:
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                        46 bits physical, 48 bits virtual
Byte Order:                           Little Endian
CPU(s):                               40
On-line CPU(s) list:                  0-39
Vendor ID:                            GenuineIntel
Model name:                           Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz
CPU family:                           6
Model:                                63
Thread(s) per core:                   2
Core(s) per socket:                   10
Socket(s):                            2
Stepping:                             2
CPU(s) scaling MHz:                   57%
CPU max MHz:                          3000.0000
CPU min MHz:                          1200.0000
BogoMIPS:                             4588.99
Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm cpuid_fault epb pti intel_ppin ssbd ibrs ibpb stibp tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm xsaveopt cqm_llc cqm_occup_llc dtherm ida arat pln pts vnmi md_clear flush_l1d
Virtualization:                       VT-x
L1d cache:                            640 KiB (20 instances)
L1i cache:                            640 KiB (20 instances)
L2 cache:                             5 MiB (20 instances)
L3 cache:                             50 MiB (2 instances)
NUMA node(s):                         2
NUMA node0 CPU(s):                    0-9,20-29
NUMA node1 CPU(s):                    10-19,30-39
Vulnerability Gather data sampling:   Not affected
Vulnerability Itlb multihit:          KVM: Mitigation: VMX disabled
Vulnerability L1tf:                   Mitigation; PTE Inversion; VMX conditional cache flushes, SMT vulnerable
Vulnerability Mds:                    Mitigation; Clear CPU buffers; SMT vulnerable
Vulnerability Meltdown:               Mitigation; PTI
Vulnerability Mmio stale data:        Mitigation; Clear CPU buffers; SMT vulnerable
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed:               Not affected
Vulnerability Spec rstack overflow:   Not affected
Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:             Mitigation; Retpolines; IBPB conditional; IBRS_FW; STIBP conditional; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds:                  Not affected
Vulnerability Tsx async abort:        Not affected

Versions of relevant libraries:
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.23.5
[pip3] onnx==1.16.2
[pip3] onnxruntime==1.19.2
[pip3] onnxruntime-gpu==1.19.2
[pip3] open_clip_torch==2.26.1
[pip3] pytorch-lightning==2.3.3
[pip3] pytorch-ranger==0.1.1
[pip3] pytorch-triton-rocm==3.0.0
[pip3] torch==2.4.1+rocm6.1
[pip3] torch-optimizer==0.3.0
[pip3] torchaudio==2.4.1+rocm6.1
[pip3] torchlibrosa==0.1.0
[pip3] torchmetrics==1.2.0
[pip3] torchvision==0.19.1+rocm6.1
[pip3] triton==3.0.0
[conda] Could not collect

Additional context

python -m xformers.info xFormers 0.0.28.post1 memory_efficient_attention.ckF: available memory_efficient_attention.ckB: available memory_efficient_attention.ck_decoderF: available memory_efficient_attention.ck_splitKF: available memory_efficient_attention.cutlassF: unavailable memory_efficient_attention.cutlassB: unavailable memory_efficient_attention.fa2F@0.0.0: unavailable memory_efficient_attention.fa2B@0.0.0: unavailable memory_efficient_attention.fa3F@0.0.0: unavailable memory_efficient_attention.fa3B@0.0.0: unavailable memory_efficient_attention.triton_splitKF: available indexing.scaled_index_addF: available indexing.scaled_index_addB: available indexing.index_select: available sequence_parallel_fused.write_values: available sequence_parallel_fused.wait_values: available sequence_parallel_fused.cuda_memset_32b_async: available sp24.sparse24_sparsify_both_ways: available sp24.sparse24_apply: available sp24.sparse24_apply_dense_output: available sp24._sparse24_gemm: available sp24._cslt_sparse_mm@0.0.0: available swiglu.dual_gemm_silu: available swiglu.gemm_fused_operand_sum: available swiglu.fused.p.cpp: available is_triton_available: True pytorch.version: 2.4.1+rocm6.1 pytorch.cuda: available gpu.compute_capability: 10.3 gpu.name: AMD Radeon Pro W6600 dcgm_profiler: unavailable build.info: available build.cuda_version: None build.hip_version: 6.1.40093-bd86f1708 build.python_version: 3.10.15 build.torch_version: 2.4.1+rocm6.1 build.env.TORCH_CUDA_ARCH_LIST:
build.env.PYTORCH_ROCM_ARCH: None build.env.XFORMERS_BUILD_TYPE: Release build.env.XFORMERS_ENABLE_DEBUG_ASSERTIONS: None build.env.NVCC_FLAGS: -allow-unsupported-compiler build.env.XFORMERS_PACKAGE_FROM: wheel-v0.0.28.post1 source.privacy: open source

lw commented 5 days ago

What exactly are you asking for help for?

The error message seems quite clear: you cannot pass float32 tensors to that operator on AMD GPUs.

If you're invoking xFormers through wunjo (no idea what that is) you should check with them to get them to fix their invocation.