KohakuBlueleaf commented 1 year ago

🐛 Bug

Command

python -m xformers.info

To Reproduce

Steps to reproduce the behavior:

install xformers 0.0.21 or build from source on latest commit on windows, memory_efficient_attention.flshattF/B are all unavailable. (Also, the build.env.TORCH_CUDA_ARCH_LIST in pre-built wheel doesn't have 8.6 and 8.9)

Expected behavior

both pre-built wheel and build from source should give us flash attention support. (If this situation is bcuz windows doesn't support some feature which is needed in flashattn2, plz at least give us flash attn1 support on windows)

I also wondered if this is some bug in xformers.info, but since xformers 0.0.21 actually give me slower result than 0.0.20, I think flash attn just gone.

Environment

Python version: 3.11.5 (tags/v3.11.5:cce6ba9, Aug 24 2023, 14:38:34) [MSC v.1936 64 bit (AMD64)] (64-bit runtime)
Python platform: Windows-10-10.0.22621-SP0
Is CUDA available: True
CUDA runtime version: 12.1.105
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 4060 Ti
Nvidia driver version: 537.34
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Additional context

here is the output of xformers.info on 0.0.21:

xFormers 0.0.21
memory_efficient_attention.cutlassF:               available
memory_efficient_attention.cutlassB:               available
memory_efficient_attention.decoderF:               available
memory_efficient_attention.flshattF@0.0.0:         unavailable
memory_efficient_attention.flshattB@0.0.0:         unavailable
memory_efficient_attention.smallkF:                available
memory_efficient_attention.smallkB:                available
memory_efficient_attention.tritonflashattF:        unavailable
memory_efficient_attention.tritonflashattB:        unavailable
indexing.scaled_index_addF:                        available
indexing.scaled_index_addB:                        available
indexing.index_select:                             available
swiglu.dual_gemm_silu:                             available
swiglu.gemm_fused_operand_sum:                     available
swiglu.fused.p.cpp:                                available
is_triton_available:                               False
is_functorch_available:                            False
pytorch.version:                                   2.0.1+cu118
pytorch.cuda:                                      available
gpu.compute_capability:                            8.9
gpu.name:                                          NVIDIA GeForce RTX 4060 Ti
build.info:                                        available
build.cuda_version:                                1108
build.python_version:                              3.11.4
build.torch_version:                               2.0.1+cu118
build.env.TORCH_CUDA_ARCH_LIST:                    5.0+PTX 6.0 6.1 7.0 7.5 8.0+PTX 9.0
build.env.XFORMERS_BUILD_TYPE:                     Release
build.env.XFORMERS_ENABLE_DEBUG_ASSERTIONS:        None
build.env.NVCC_FLAGS:                              None
build.env.XFORMERS_PACKAGE_FROM:                   wheel-v0.0.21
build.nvcc_version:                                11.8.89
source.privacy:                                    open source

Here is the output of 0.0.20:

xFormers 0.0.20
memory_efficient_attention.cutlassF:               available
memory_efficient_attention.cutlassB:               available
memory_efficient_attention.flshattF:               available
memory_efficient_attention.flshattB:               available
memory_efficient_attention.smallkF:                available
memory_efficient_attention.smallkB:                available
memory_efficient_attention.tritonflashattF:        unavailable
memory_efficient_attention.tritonflashattB:        unavailable
indexing.scaled_index_addF:                        available
indexing.scaled_index_addB:                        available
indexing.index_select:                             available
swiglu.dual_gemm_silu:                             available
swiglu.gemm_fused_operand_sum:                     available
swiglu.fused.p.cpp:                                available
is_triton_available:                               False
is_functorch_available:                            False
pytorch.version:                                   2.0.1+cu118
pytorch.cuda:                                      available
gpu.compute_capability:                            8.9
gpu.name:                                          NVIDIA GeForce RTX 4060 Ti
build.info:                                        available
build.cuda_version:                                1108
build.python_version:                              3.11.3
build.torch_version:                               2.0.1+cu118
build.env.TORCH_CUDA_ARCH_LIST:                    5.0+PTX 6.0 6.1 7.0 7.5 8.0 8.6
build.env.XFORMERS_BUILD_TYPE:                     Release
build.env.XFORMERS_ENABLE_DEBUG_ASSERTIONS:        None
build.env.NVCC_FLAGS:                              None
build.env.XFORMERS_PACKAGE_FROM:                   wheel-v0.0.20
build.nvcc_version:                                11.8.89
source.privacy:                                    open source

rltgjqmcpgjadyd commented 1 year ago

af6b866f1b1340f2b4681d1ad1c5fe96957307a9 commit has same problem

xFormers 0.0.22+af6b866.d20230926
memory_efficient_attention.cutlassF:               available
memory_efficient_attention.cutlassB:               available
memory_efficient_attention.decoderF:               available
memory_efficient_attention.flshattF@0.0.0:         unavailable
memory_efficient_attention.flshattB@0.0.0:         unavailable
memory_efficient_attention.smallkF:                available
memory_efficient_attention.smallkB:                available
memory_efficient_attention.tritonflashattF:        unavailable
memory_efficient_attention.tritonflashattB:        unavailable
memory_efficient_attention.triton_splitKF:         unavailable
indexing.scaled_index_addF:                        available
indexing.scaled_index_addB:                        available
indexing.index_select:                             available
swiglu.dual_gemm_silu:                             available
swiglu.gemm_fused_operand_sum:                     available
swiglu.fused.p.cpp:                                available
is_triton_available:                               False
pytorch.version:                                   2.1.0.dev20230821+cu121
pytorch.cuda:                                      available
gpu.compute_capability:                            8.9
gpu.name:                                          NVIDIA GeForce RTX 4090
build.info:                                        available
build.cuda_version:                                1201
build.python_version:                              3.11.5
build.torch_version:                               2.1.0.dev20230821+cu121
build.env.TORCH_CUDA_ARCH_LIST:                    8.9
build.env.XFORMERS_BUILD_TYPE:                     None
build.env.XFORMERS_ENABLE_DEBUG_ASSERTIONS:        None
build.env.NVCC_FLAGS:                              "-allow-unsupported-compiler"
build.env.XFORMERS_PACKAGE_FROM:                   None
build.nvcc_version:                                12.1.66
source.privacy:                                    open source

danthe3rd commented 1 year ago

Hi, Flash-Attention does not support windows at the moment, so we don't build it on windows (see for instance https://github.com/Dao-AILab/flash-attention/issues/565). We still can run our own implementation which should be a bit faster than Flash v1 (but slower than Flash v2). Once Flash-Attention v2 has support for windows, we will add it back.

rbertus2000 commented 11 months ago

Hi, Flash-Attention does not support windows at the moment, so we don't build it on windows (see for instance Dao-AILab/flash-attention#565). We still can run our own implementation which should be a bit faster than Flash v1 (but slower than Flash v2). Once Flash-Attention v2 has support for windows, we will add it back.

It seems like flash-attention 2.3.2 supports windows now. https://github.com/Dao-AILab/flash-attention/issues/595#issuecomment-1752281403

KohakuBlueleaf commented 11 months ago

Hi, Flash-Attention does not support windows at the moment, so we don't build it on windows (see for instance Dao-AILab/flash-attention#565). We still can run our own implementation which should be a bit faster than Flash v1 (but slower than Flash v2). Once Flash-Attention v2 has support for windows, we will add it back.

It seems like flash-attention 2.3.2 supports windows now. Dao-AILab/flash-attention#595 (comment)

I will try to build flash attn with torch2.1.0 and cuda12.1 to see if it worked

Panchovix commented 11 months ago

Hi, Flash-Attention does not support windows at the moment, so we don't build it on windows (see for instance Dao-AILab/flash-attention#565). We still can run our own implementation which should be a bit faster than Flash v1 (but slower than Flash v2). Once Flash-Attention v2 has support for windows, we will add it back.

It seems like flash-attention 2.3.2 supports windows now. Dao-AILab/flash-attention#595 (comment)

I will try to build flash attn with torch2.1.0 and cuda12.1 to see if it worked

Does xformers automatically uses if FA2 is installed in the venv, or you have to build it with FA2 installed instead?