facebookresearch / xformers

Hackable and optimized Transformers building blocks, supporting a composable construction.
https://facebookresearch.github.io/xformers/
Other
8.41k stars 597 forks source link

Flash attention unavailable after 0.0.21 on Windows system #863

Open KohakuBlueleaf opened 1 year ago

KohakuBlueleaf commented 1 year ago

šŸ› Bug

Command

python -m xformers.info

To Reproduce

Steps to reproduce the behavior:

install xformers 0.0.21 or build from source on latest commit on windows, memory_efficient_attention.flshattF/B are all unavailable. (Also, the build.env.TORCH_CUDA_ARCH_LIST in pre-built wheel doesn't have 8.6 and 8.9)

Expected behavior

both pre-built wheel and build from source should give us flash attention support. (If this situation is bcuz windows doesn't support some feature which is needed in flashattn2, plz at least give us flash attn1 support on windows)

I also wondered if this is some bug in xformers.info, but since xformers 0.0.21 actually give me slower result than 0.0.20, I think flash attn just gone.

Environment

Additional context

here is the output of xformers.info on 0.0.21:

xFormers 0.0.21
memory_efficient_attention.cutlassF:               available
memory_efficient_attention.cutlassB:               available
memory_efficient_attention.decoderF:               available
memory_efficient_attention.flshattF@0.0.0:         unavailable
memory_efficient_attention.flshattB@0.0.0:         unavailable
memory_efficient_attention.smallkF:                available
memory_efficient_attention.smallkB:                available
memory_efficient_attention.tritonflashattF:        unavailable
memory_efficient_attention.tritonflashattB:        unavailable
indexing.scaled_index_addF:                        available
indexing.scaled_index_addB:                        available
indexing.index_select:                             available
swiglu.dual_gemm_silu:                             available
swiglu.gemm_fused_operand_sum:                     available
swiglu.fused.p.cpp:                                available
is_triton_available:                               False
is_functorch_available:                            False
pytorch.version:                                   2.0.1+cu118
pytorch.cuda:                                      available
gpu.compute_capability:                            8.9
gpu.name:                                          NVIDIA GeForce RTX 4060 Ti
build.info:                                        available
build.cuda_version:                                1108
build.python_version:                              3.11.4
build.torch_version:                               2.0.1+cu118
build.env.TORCH_CUDA_ARCH_LIST:                    5.0+PTX 6.0 6.1 7.0 7.5 8.0+PTX 9.0
build.env.XFORMERS_BUILD_TYPE:                     Release
build.env.XFORMERS_ENABLE_DEBUG_ASSERTIONS:        None
build.env.NVCC_FLAGS:                              None
build.env.XFORMERS_PACKAGE_FROM:                   wheel-v0.0.21
build.nvcc_version:                                11.8.89
source.privacy:                                    open source

Here is the output of 0.0.20:

xFormers 0.0.20
memory_efficient_attention.cutlassF:               available
memory_efficient_attention.cutlassB:               available
memory_efficient_attention.flshattF:               available
memory_efficient_attention.flshattB:               available
memory_efficient_attention.smallkF:                available
memory_efficient_attention.smallkB:                available
memory_efficient_attention.tritonflashattF:        unavailable
memory_efficient_attention.tritonflashattB:        unavailable
indexing.scaled_index_addF:                        available
indexing.scaled_index_addB:                        available
indexing.index_select:                             available
swiglu.dual_gemm_silu:                             available
swiglu.gemm_fused_operand_sum:                     available
swiglu.fused.p.cpp:                                available
is_triton_available:                               False
is_functorch_available:                            False
pytorch.version:                                   2.0.1+cu118
pytorch.cuda:                                      available
gpu.compute_capability:                            8.9
gpu.name:                                          NVIDIA GeForce RTX 4060 Ti
build.info:                                        available
build.cuda_version:                                1108
build.python_version:                              3.11.3
build.torch_version:                               2.0.1+cu118
build.env.TORCH_CUDA_ARCH_LIST:                    5.0+PTX 6.0 6.1 7.0 7.5 8.0 8.6
build.env.XFORMERS_BUILD_TYPE:                     Release
build.env.XFORMERS_ENABLE_DEBUG_ASSERTIONS:        None
build.env.NVCC_FLAGS:                              None
build.env.XFORMERS_PACKAGE_FROM:                   wheel-v0.0.20
build.nvcc_version:                                11.8.89
source.privacy:                                    open source
rltgjqmcpgjadyd commented 1 year ago

af6b866f1b1340f2b4681d1ad1c5fe96957307a9 commit has same problem

xFormers 0.0.22+af6b866.d20230926
memory_efficient_attention.cutlassF:               available
memory_efficient_attention.cutlassB:               available
memory_efficient_attention.decoderF:               available
memory_efficient_attention.flshattF@0.0.0:         unavailable
memory_efficient_attention.flshattB@0.0.0:         unavailable
memory_efficient_attention.smallkF:                available
memory_efficient_attention.smallkB:                available
memory_efficient_attention.tritonflashattF:        unavailable
memory_efficient_attention.tritonflashattB:        unavailable
memory_efficient_attention.triton_splitKF:         unavailable
indexing.scaled_index_addF:                        available
indexing.scaled_index_addB:                        available
indexing.index_select:                             available
swiglu.dual_gemm_silu:                             available
swiglu.gemm_fused_operand_sum:                     available
swiglu.fused.p.cpp:                                available
is_triton_available:                               False
pytorch.version:                                   2.1.0.dev20230821+cu121
pytorch.cuda:                                      available
gpu.compute_capability:                            8.9
gpu.name:                                          NVIDIA GeForce RTX 4090
build.info:                                        available
build.cuda_version:                                1201
build.python_version:                              3.11.5
build.torch_version:                               2.1.0.dev20230821+cu121
build.env.TORCH_CUDA_ARCH_LIST:                    8.9
build.env.XFORMERS_BUILD_TYPE:                     None
build.env.XFORMERS_ENABLE_DEBUG_ASSERTIONS:        None
build.env.NVCC_FLAGS:                              "-allow-unsupported-compiler"
build.env.XFORMERS_PACKAGE_FROM:                   None
build.nvcc_version:                                12.1.66
source.privacy:                                    open source
danthe3rd commented 1 year ago

Hi, Flash-Attention does not support windows at the moment, so we don't build it on windows (see for instance https://github.com/Dao-AILab/flash-attention/issues/565). We still can run our own implementation which should be a bit faster than Flash v1 (but slower than Flash v2). Once Flash-Attention v2 has support for windows, we will add it back.

rbertus2000 commented 11 months ago

Hi, Flash-Attention does not support windows at the moment, so we don't build it on windows (see for instance Dao-AILab/flash-attention#565). We still can run our own implementation which should be a bit faster than Flash v1 (but slower than Flash v2). Once Flash-Attention v2 has support for windows, we will add it back.

It seems like flash-attention 2.3.2 supports windows now. https://github.com/Dao-AILab/flash-attention/issues/595#issuecomment-1752281403

KohakuBlueleaf commented 11 months ago

Hi, Flash-Attention does not support windows at the moment, so we don't build it on windows (see for instance Dao-AILab/flash-attention#565). We still can run our own implementation which should be a bit faster than Flash v1 (but slower than Flash v2). Once Flash-Attention v2 has support for windows, we will add it back.

It seems like flash-attention 2.3.2 supports windows now. Dao-AILab/flash-attention#595 (comment)

I will try to build flash attn with torch2.1.0 and cuda12.1 to see if it worked

Panchovix commented 11 months ago

Hi, Flash-Attention does not support windows at the moment, so we don't build it on windows (see for instance Dao-AILab/flash-attention#565). We still can run our own implementation which should be a bit faster than Flash v1 (but slower than Flash v2). Once Flash-Attention v2 has support for windows, we will add it back.

It seems like flash-attention 2.3.2 supports windows now. Dao-AILab/flash-attention#595 (comment)

I will try to build flash attn with torch2.1.0 and cuda12.1 to see if it worked

Does xformers automatically uses if FA2 is installed in the venv, or you have to build it with FA2 installed instead?

KohakuBlueleaf commented 11 months ago

@danthe3rd Flash attention is able to be compiled/installed on windows after 2.3.2 Will xformers update for it?