Closed Panchovix closed 11 months ago
Is there any additional requisite besides the mentioned, to install flash-attn on Windows?
I've no idea since it's only been tested on Linux, and I don't have access to a Windows machine. If you figure out how to build on Windows (or what we need to change to support Windows), please lmk.
@Panchovix are you saying we can now compile flash-attn on Windows somehow? I couldn't with the latest pull, unless I'm missing something.
@Panchovix are you saying we can now compile flash-attn on Windows somehow? I couldn't with the latest pull, unless I'm missing something.
Yes, now it is possible. Latest pull should work. You do need CUDA 12.x though, since CUDA 11.8 and lower don't support it.
I've uploaded a wheel here https://huggingface.co/Panchovix/flash-attn-2-windows-test-wheel
More discussion here: https://github.com/Dao-AILab/flash-attention/issues/595
Thanks, 11.8 was my error. Woohoo!
@Panchovix are you saying we can now compile flash-attn on Windows somehow? I couldn't with the latest pull, unless I'm missing something.
Yes, now it is possible. Latest pull should work. You do need CUDA 12.x though, since CUDA 11.8 and lower don't support it.
I've uploaded a wheel here https://huggingface.co/Panchovix/flash-attn-2-windows-test-wheel
More discussion here: https://github.com/Dao-AILab/flash-attention/issues/595
The link gives a 404 now
There are binaries here. I can't build anything beyond 2.4.2 from source myself and can't find Windows binaries beyond that anywhere. 2.4.2 works fine with current packages though.
With some untraceable magic I've built 2.5.6 on windows 10. It took ~2.5 hours for compiling.
Cuda 12.4 Torch 2.2.2+cu121 ninja 1.11.1
For anyone looking to use Flash Attention on Windows, I got it working after some tweaking. You have to make sure that Cuda 12.4 is installed, and PyTorch should be 2.2.2+cu121. I used pip and it took about 2 hours to finish setup. Hope this helps anyone who wants to use flash-attn on Windows. BTW I am using windows 11 pro, mileage may vary on Windows 10.
For anyone looking to use Flash Attention on Windows, I got it working after some tweaking. You have to make sure that Cuda 12.4 is installed, and PyTorch should be 2.2.2+cu121. I used pip and it took about 2 hours to finish setup. Hope this helps anyone who wants to use flash-attn on Windows. BTW I am using windows 11 pro, mileage may vary on Windows 10.
have seen significant improvements after using flash attention? how much ?
I was able to get it working. The problem seems to be that many ml frameworks don’t support flash attention on windows. You would have to do tests for yourself, but it seems like ctransformers does use it. Since I didn’t check the performance before installing flash attention, I couldn’t say what the improvements were.
On Mon, Apr 8, 2024 at 11:56 PM sadimoodi @.***> wrote:
For anyone looking to use Flash Attention on Windows, I got it working after some tweaking. You have to make sure that Cuda 12.4 is installed, and PyTorch should be 2.2.2+cu121. I used pip and it took about 2 hours to finish setup. Hope this helps anyone who wants to use flash-attn on Windows. BTW I am using windows 11 pro, mileage may vary on Windows 10.
have seen significant improvements after using flash attention? how much ?
— Reply to this email directly, view it on GitHub https://github.com/Dao-AILab/flash-attention/issues/553#issuecomment-2044104174, or unsubscribe https://github.com/notifications/unsubscribe-auth/A42G35XPZ52I7HND4ZCNSODY4NRF7AVCNFSM6AAAAAA45N5VROVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANBUGEYDIMJXGQ . You are receiving this because you commented.Message ID: @.***>
Got it working on Windows 10 as well on Torch 2.2.2 (with Cuda 12.4 installed). Took around 15-20 min to compile on a 64-core threadripper with Ninja, so it does scale well with compute.
Version 2.5.7 working on my Windows 10, building took around 2h:
pip install flash-attn --no-build-isolation Collecting flash-attn Using cached flash_attn-2.5.7.tar.gz (2.5 MB) Preparing metadata (setup.py) ... done Requirement already satisfied: torch in (from flash-attn) (2.2.2+cu121) Requirement already satisfied: einops in (from flash-attn) (0.7.0) Requirement already satisfied: packaging in (from flash-attn) (24.0) Requirement already satisfied: ninja in (from flash-attn) (1.11.1.1) Requirement already satisfied: filelock in (from torch->flash-attn) (3.13.3) Requirement already satisfied: typing-extensions>=4.8.0 in (from torch->flash-attn) (4.11.0) Requirement already satisfied: sympy in (from torch->flash-attn) (1.12) Requirement already satisfied: networkx in (from torch->flash-attn) (2.8.8) Requirement already satisfied: jinja2 in (from torch->flash-attn) (3.1.3) Requirement already satisfied: fsspec in (from torch->flash-attn) (2024.3.1) Requirement already satisfied: MarkupSafe>=2.0 in (from jinja2->torch->flash-attn) (2.1.5) Requirement already satisfied: mpmath>=0.19 in (from sympy->torch->flash-attn) (1.3.0) Building wheels for collected packages: flash-attn Building wheel for flash-attn (setup.py) ... done Created wheel for flash-attn: filename=flash_attn-2.5.7-cp311-cp311-win_amd64.whl size=117462147 Stored in directory: c:\users\appdata\local\pip\cache\wheels\94\a7\df\cf319d566d2bb53c7f3dd1b15ab2736cabca3e6410c75bd206 Successfully built flash-attn Installing collected packages: flash-attn Successfully installed flash-attn-2.5.7
Any luck getting it to work with cuda 11.8?
(...) building took around 2h:
A package that needs 2 hours to install? Sorry, but that's a no-go for me. Any ways to speed up this up in the future? Maybe as an installer instead of a package?
A package that needs 2 hours to install? Sorry, but that's a no-go for me. Any ways to speed up this up in the future? Maybe as an installer instead of a package?
Well it doesn't take that long if you have a multi-core processor (it's the compile time). In general you're right, someone should maintain pre-built wheels, and someone usually does, but it's not consistent for Windows builds right now and you have to search GitHub for someone who has uploaded a recent build.
The good news is FA2 is a pretty stable product right now I think, and you can grab an older wheel and it'll probably work just as well, as long as it supports the CUDA version you're using.
Any luck getting it to work with cuda 11.8?
I tried but it would not compile. Might be one of the dependencies (like cutlass?) needs 12.0.
are there more recent builds for windows? I get the same error.
and for the 2.4.2 binaries I get this error:
ImportError: DLL load failed while importing flash_attn_2_cuda: The specified procedure could not be found."
thanks for quick reply. unfortunately the same error persists with these builds. maybe something in my PATH is missing as I did get strange C++ build tools errors I managed to workaround but perhaps not completely fix... prebuilt is better of course.
I have cuda 12.4 btw and these say cu123... hmm.
That should be fine, technically. CUDA libs are generally backwards compatible, as long as your torch also has a compatible CUDA build. Does the latest pre-built wheel work? I do get the error you’re getting if I use a newer package with an older flash-attn wheel, or build an older version of flash-attn. Maybe some non-compatible change that was never reported in Windows. But most recent build or wheel of flash-attn removes that error for me.
I finally found a root cause for the build fail.
https://stackoverflow.com/a/78576792/13305027
I don't understand the VS 2022 version thing because that's what I have installed, but apparently it is related to some minor version it wasn't entirely clear how to downgrade to another 2022 version, so perhaps installing <2022 would suffice.
alternatively upgrade to cuda 12.4, preferably 12.5 it seems.
I am now testing a different approach to fix support without reinstallation.
pip calls setup.py script which calls pytorch cpp_extension.py which builds using ninja which calls nvcc... and using --allow-unsupported-compiler
should workaround the issue.
ps. perhaps worth mentioning on WSL it's pretty much hassle free, except there I had an error related to flash-attn, so on that other project I could simply bypass flash-attn by finding and setting use_flash_attn = False
in the code. (might be something similar for you)
That makes sense, the cases when I got that error I was probably linking with CUDA 12.1, and in my recent builds I had switched to 12.5. Also have a very early version of 2022 and have never updated since it was first released.
What was the issue with WSL? It seems to work fine for me.
I had this error: https://github.com/facebookresearch/segment-anything-2/issues/100
I ended up using the same type of solution they proposed which is to bypass flash attn altogether.
perhaps this is the case for anyone reading this thread, but not so helpful if you actually need flash attn.
I'm not sure what the implications of this would be but that repo seemed to work without it. maybe you have some insights?
I suspect that this is caused by version differences and how absurdly easy the import paths get messed up on Windows, and ultimately caused by that with Windows unless you're using Conda you really need to figure out yourself which versions are compatible, and even then you need to know to install things in the right order.
What worked for me, unintuitive things in bold:
1) Uninstalll pytorch, torchvision, xformers & torchaudio 2) Uninstall all MSVC C++ build tools 3) Uninstall all Cuda, Cuda Toolkit, CUDNN, and other Nvidia SDKs (read: type 'nvidia', 'cudnn' and 'cuda' into the add/remove programs feature and remove anything that isn't geforce experience or drivers) 4) Restart 5) Install MSVC C++ build tools (I have Visual Studio Community 2022, 17.11.1, the most recent one, and I also added MSVC v143 build tools for v17.9 6) Install all the CUDA things. I went for Cuda 12.4.1 and CUDNN 9.2.1. Do NOT install this first. The Cuda toolkit HAS to configure the MSVC setup! 7) Install pytorch (2.4.1, torchaudio 2.4.1 and torchvision 0.19) 8) Restart (yes, unlike a lot of guides that say you have to, you actually have to. It will not work otherwise. I tried)
TL;DR: Pay super close attention to which versions are installed all over your system, and consider doing a clean re-install of CUDA stuff.
As for easing this going forward, I think adding some sanity checks in the build process to see which versions are installed, if the include paths are sensible, etc, and if they make sense would be a good step. As a 'crash early' mitigation, maybe we could do a quick build of some Cuda hello world before kicking off the main process? As long as the program isn't too trivial I think it's highly likely to catch build misconfigurations.
I followed steps 1-4 (made sure to remove all CUDA / CuDNN from Add/Remove programs - only the Geforce drivers & Geforce experience remained).
Installed the Latest Microsoft Visual C++ Redistributable Version after step 8 to fix "OSError WinError 126, error loading fbgemm.dll or dependencies]" (occured when running "import pytorch")
Installed CUDA 12.4.1
Windows 11: cuDNN 9.2 was installed from the tarball:
Extract the downloaded zip file to a temporary location.
Copy the extracted files to the CUDA toolkit directory:
Set up environment variables:
Open the Start menu and type "Environment Variables"
Click on "Edit the system environment variables"
Click the "Environment Variables" button
Under "System variables", find and edit the "Path" variable
Add the following path: C:\Program Files\NVIDIA\CUDNN\v9.2\bin
Verify the installation:
Open a new command prompt
Run the following command to check if cuDNN is properly installed: where cudnn*.dll
This should display the path to the cuDNN DLL files.
I created a new venv and installed PyTorch 2.4 by modifying Step 7:
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
Finally:
packaging
: pip install packaging
and pip install wheel
python setup.py install
It's currently building (with a lot of warnings in the process, such as \flash_bwd_kernel.h(483): warning #177-D: variable "dtanh" was declared but never referenced
)
Build completed . Created a .whl file with python setup.py bdist_wheel
:
@sunsetcoder Thank you!, I've been trying to get flash_attn installed for days, these instructions are the first ones that worked.
@evilalmus You're welcome. Make sure to use Python 3.10. 3.12 no bueno
3.11.9 worked for me.
Hi there, impressive work. Tested in on Linux and the VRAM and speeds with higher context is impressive (tested on exllamav2)
I've tried to do the same on Windows for exllamav2, but I have issues when either compiling or building from source.
I tried with:
Torch 2.0.1+cu118 and CUDA 11.8 Torch 2.2+cu121 and CUDA 12.1 Visual Studio 2022
The errors are these, based on if doing
python setup.py install
from source or doing it via pip.Compiling from source error
``` [2/49] C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\bin\nvcc --generate-dependencies-with-compile --dependency-output F:\ChatIAs\oobabooga\flash-attention\build\temp.win-amd64-cpython-310\Release\csrc/flash_attn/src/flash_bwd_hdim160_fp16_sm80.obj.d -std=c++17 --use-local-env -Xcompiler /MD -Xcompiler /wd4819 -Xcompiler /wd4251 -Xcompiler /wd4244 -Xcompiler /wd4267 -Xcompiler /wd4275 -Xcompiler /wd4018 -Xcompiler /wd4190 -Xcompiler /wd4624 -Xcompiler /wd4067 -Xcompiler /wd4068 -Xcompiler /EHsc -Xcudafe --diag_suppress=base_class_has_different_dll_interface -Xcudafe --diag_suppress=field_without_dll_interface -Xcudafe --diag_suppress=dll_interface_conflict_none_assumed -Xcudafe --diag_suppress=dll_interface_conflict_dllexport_assumed -IF:\ChatIAs\oobabooga\flash-attention\csrc\flash_attn -IF:\ChatIAs\oobabooga\flash-attention\csrc\flash_attn\src -IF:\ChatIAs\oobabooga\flash-attention\csrc\cutlass\include -IF:\ChatIAs\oobabooga\venv\lib\site-packages\torch\include -IF:\ChatIAs\oobabooga\venv\lib\site-packages\torch\include\torch\csrc\api\include -IF:\ChatIAs\oobabooga\venv\lib\site-packages\torch\include\TH -IF:\ChatIAs\oobabooga\venv\lib\site-packages\torch\include\THC "-IC:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\include" -IF:\ChatIAs\oobabooga\venv\include -IC:\Users\Pancho\AppData\Local\Programs\Python\Python310\include -IC:\Users\Pancho\AppData\Local\Programs\Python\Python310\Include "-IC:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.36.32532\include" "-IC:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.36.32532\ATLMFC\include" "-IC:\Program Files\Microsoft Visual Studio\2022\Community\VC\Auxiliary\VS\include" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.22000.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\10\\include\10.0.22000.0\\um" "-IC:\Program Files (x86)\Windows Kits\10\\include\10.0.22000.0\\shared" "-IC:\Program Files (x86)\Windows Kits\10\\include\10.0.22000.0\\winrt" "-IC:\Program Files (x86)\Windows Kits\10\\include\10.0.22000.0\\cppwinrt" -c F:\ChatIAs\oobabooga\flash-attention\csrc\flash_attn\src\flash_bwd_hdim160_fp16_sm80.cu -o F:\ChatIAs\oobabooga\flash-attention\build\temp.win-amd64-cpython-310\Release\csrc/flash_attn/src/flash_bwd_hdim160_fp16_sm80.obj -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -O3 -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -lineinfo -gencode arch=compute_80,code=sm_80 --threads 4 -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0 FAILED: F:/ChatIAs/oobabooga/flash-attention/build/temp.win-amd64-cpython-310/Release/csrc/flash_attn/src/flash_bwd_hdim160_fp16_sm80.obj C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\bin\nvcc --generate-dependencies-with-compile --dependency-output F:\ChatIAs\oobabooga\flash-attention\build\temp.win-amd64-cpython-310\Release\csrc/flash_attn/src/flash_bwd_hdim160_fp16_sm80.obj.d -std=c++17 --use-local-env -Xcompiler /MD -Xcompiler /wd4819 -Xcompiler /wd4251 -Xcompiler /wd4244 -Xcompiler /wd4267 -Xcompiler /wd4275 -Xcompiler /wd4018 -Xcompiler /wd4190 -Xcompiler /wd4624 -Xcompiler /wd4067 -Xcompiler /wd4068 -Xcompiler /EHsc -Xcudafe --diag_suppress=base_class_has_different_dll_interface -Xcudafe --diag_suppress=field_without_dll_interface -Xcudafe --diag_suppress=dll_interface_conflict_none_assumed -Xcudafe --diag_suppress=dll_interface_conflict_dllexport_assumed -IF:\ChatIAs\oobabooga\flash-attention\csrc\flash_attn -IF:\ChatIAs\oobabooga\flash-attention\csrc\flash_attn\src -IF:\ChatIAs\oobabooga\flash-attention\csrc\cutlass\include -IF:\ChatIAs\oobabooga\venv\lib\site-packages\torch\include -IF:\ChatIAs\oobabooga\venv\lib\site-packages\torch\include\torch\csrc\api\include -IF:\ChatIAs\oobabooga\venv\lib\site-packages\torch\include\TH -IF:\ChatIAs\oobabooga\venv\lib\site-packages\torch\include\THC "-IC:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\include" -IF:\ChatIAs\oobabooga\venv\include -IC:\Users\Pancho\AppData\Local\Programs\Python\Python310\include -IC:\Users\Pancho\AppData\Local\Programs\Python\Python310\Include "-IC:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.36.32532\include" "-IC:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.36.32532\ATLMFC\include" "-IC:\Program Files\Microsoft Visual Studio\2022\Community\VC\Auxiliary\VS\include" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.22000.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\10\\include\10.0.22000.0\\um" "-IC:\Program Files (x86)\Windows Kits\10\\include\10.0.22000.0\\shared" "-IC:\Program Files (x86)\Windows Kits\10\\include\10.0.22000.0\\winrt" "-IC:\Program Files (x86)\Windows Kits\10\\include\10.0.22000.0\\cppwinrt" -c F:\ChatIAs\oobabooga\flash-attention\csrc\flash_attn\src\flash_bwd_hdim160_fp16_sm80.cu -o F:\ChatIAs\oobabooga\flash-attention\build\temp.win-amd64-cpython-310\Release\csrc/flash_attn/src/flash_bwd_hdim160_fp16_sm80.obj -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -O3 -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -lineinfo -gencode arch=compute_80,code=sm_80 --threads 4 -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0 flash_bwd_hdim160_fp16_sm80.cu cl : Línea de comandos warning D9025 : invalidando '/D__CUDA_NO_HALF_OPERATORS__' con '/U__CUDA_NO_HALF_OPERATORS__' cl : Línea de comandos warning D9025 : invalidando '/D__CUDA_NO_HALF_CONVERSIONS__' con '/U__CUDA_NO_HALF_CONVERSIONS__' cl : Línea de comandos warning D9025 : invalidando '/D__CUDA_NO_HALF2_OPERATORS__' con '/U__CUDA_NO_HALF2_OPERATORS__' cl : Línea de comandos warning D9025 : invalidando '/D__CUDA_NO_BFLOAT16_CONVERSIONS__' con '/U__CUDA_NO_BFLOAT16_CONVERSIONS__' flash_bwd_hdim160_fp16_sm80.cu cl : Línea de comandos warning D9025 : invalidando '/D__CUDA_NO_HALF_OPERATORS__' con '/U__CUDA_NO_HALF_OPERATORS__' cl : Línea de comandos warning D9025 : invalidando '/D__CUDA_NO_HALF_CONVERSIONS__' con '/U__CUDA_NO_HALF_CONVERSIONS__' cl : Línea de comandos warning D9025 : invalidando '/D__CUDA_NO_HALF2_OPERATORS__' con '/U__CUDA_NO_HALF2_OPERATORS__' cl : Línea de comandos warning D9025 : invalidando '/D__CUDA_NO_BFLOAT16_CONVERSIONS__' con '/U__CUDA_NO_BFLOAT16_CONVERSIONS__' flash_bwd_hdim160_fp16_sm80.cu F:/ChatIAs/oobabooga/flash-attention/csrc/cutlass/include\cute/arch/mma_sm90_desc.hpp(143): warning #226-D: invalid format string conversion printf("GmmaDescriptor: 0x%016 %lli\n", static_cast