Dao-AILab / flash-attention

Fast and memory-efficient exact attention
BSD 3-Clause "New" or "Revised" License
14.06k stars 1.31k forks source link

Build flash-attn takes a lot of time #1038

Open Sayli2000 opened 3 months ago

Sayli2000 commented 3 months ago

I'm trying to install the flash-attn package but it takes too much time. I've made sure that ninja is installed. image image

Ph0rk0z commented 3 months ago

If I ever get out of GH jail: https://github.com/Dao-AILab/flash-attention/pull/1025#issuecomment-2207077088

tridao commented 3 months ago

yep it takes a long time because of all the templating

puneeshkhanna commented 3 months ago

Same here

MAX_JOBS=4 pip -v install flash-attn==2.6.3 --no-build-isolation

I used verbose option ; it gets stuck in C++ compilation indefinitely. I tried other versions but same problem.

copying flash_attn/ops/triton/init.py -> build/lib.linux-x86_64-cpython-310/flash_attn/ops/triton copying flash_attn/ops/triton/cross_entropy.py -> build/lib.linux-x86_64-cpython-310/flash_attn/ops/triton copying flash_attn/ops/triton/k_activations.py -> build/lib.linux-x86_64-cpython-310/flash_attn/ops/triton copying flash_attn/ops/triton/layer_norm.py -> build/lib.linux-x86_64-cpython-310/flash_attn/ops/triton copying flash_attn/ops/triton/linear.py -> build/lib.linux-x86_64-cpython-310/flash_attn/ops/triton copying flash_attn/ops/triton/mlp.py -> build/lib.linux-x86_64-cpython-310/flash_attn/ops/triton copying flash_attn/ops/triton/rotary.py -> build/lib.linux-x86_64-cpython-310/flash_attn/ops/triton running build_ext /lustre1/tier2/users/puneesh.khanna/miniconda3/envs/falcon-moe/lib/python3.10/site-packages/torch/utils/cpp_extension.py:418: UserWarning: The detected CUDA version (12.2) has a minor version mismatch with the version that was used to compile PyTorch (12.1). Most likely this shouldn't be a problem. warnings.warn(CUDA_MISMATCH_WARN.format(cuda_str_version, torch.version.cuda)) /lustre1/tier2/users/puneesh.khanna/miniconda3/envs/falcon-moe/lib/python3.10/site-packages/torch/utils/cpp_extension.py:428: UserWarning: There are no g++ version bounds defined for CUDA version 12.2 warnings.warn(f'There are no {compiler_name} version bounds defined for CUDA version {cuda_str_version}') building 'flash_attn_2_cuda' extension creating /tmp/pip-install-14eos5qz/flash-attn_021be3b5eaac41e793324f2128cf5d4c/build/temp.linux-x86_64-cpython-310 creating /tmp/pip-install-14eos5qz/flash-attn_021be3b5eaac41e793324f2128cf5d4c/build/temp.linux-x86_64-cpython-310/csrc creating /tmp/pip-install-14eos5qz/flash-attn_021be3b5eaac41e793324f2128cf5d4c/build/temp.linux-x86_64-cpython-310/csrc/flash_attn creating /tmp/pip-install-14eos5qz/flash-attn_021be3b5eaac41e793324f2128cf5d4c/build/temp.linux-x86_64-cpython-310/csrc/flash_attn/src Emitting ninja build file /tmp/pip-install-14eos5qz/flash-attn_021be3b5eaac41e793324f2128cf5d4c/build/temp.linux-x86_64-cpython-310/build.ninja... Compiling objects... Using envvar MAX_JOBS (4) as the number of workers... [1/85] c++ -MMD -MF /tmp/pip-install-14eos5qz/flash-attn_021be3b5eaac41e793324f2128cf5d4c/build/temp.linux-x86_64-cpython-310/csrc/flash_attn/flash_api.o.d -pthread -B /lustre1/tier2/users/puneesh.khanna/miniconda3/envs/venv/compiler_compat -Wno-unused-result -Wsign-compare -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /lustre1/tier2/users/puneesh.khanna/miniconda3/envs/venv/include -fPIC -O2 -isystem /lustre1/tier2/users/puneesh.khanna/miniconda3/envs/venv/include -fPIC -I/tmp/pip-install-14eos5qz/flash-attn_021be3b5eaac41e793324f2128cf5d4c/csrc/flash_attn -I/tmp/pip-install-14eos5qz/flash-attn_021be3b5eaac41e793324f2128cf5d4c/csrc/flash_attn/src -I/tmp/pip-install-14eos5qz/flash-attn_021be3b5eaac41e793324f2128cf5d4c/csrc/cutlass/include -I/lustre1/tier2/users/puneesh.khanna/miniconda3/envs/venv/lib/python3.10/site-packages/torch/include -I/lustre1/tier2/users/puneesh.khanna/miniconda3/envs/venv/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/lustre1/tier2/users/puneesh.khanna/miniconda3/envs/venv/lib/python3.10/site-packages/torch/include/TH -I/lustre1/tier2/users/puneesh.khanna/miniconda3/envs/venv/lib/python3.10/site-packages/torch/include/THC -I/usr/local/cuda/include -I/lustre1/tier2/users/puneesh.khanna/miniconda3/envs/venv/include/python3.10 -c -c /tmp/pip-install-14eos5qz/flash-attn_021be3b5eaac41e793324f2128cf5d4c/csrc/flash_attn/flash_api.cpp -o /tmp/pip-install-14eos5qz/flash-attn_021be3b5eaac41e793324f2128cf5d4c/build/temp.linux-x86_64-cpython-310/csrc/flash_attn/flash_api.o -O3 -std=c++17 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0

Rahman2001 commented 1 month ago

How long it takes to finish building the wheel? Mine is still building since yesterday. I use GeForce 930MX GPU.

faaany commented 2 weeks ago

I am using the docker image "nvidia/cuda:12.1.0-devel-ubuntu22.04", and pip install flash-attn --no-build-isolation takes forever...

JohannesAck commented 1 week ago

Upgrading pip, wheel and setuptools helped me improve the compile time a lot.

python -m pip install --upgrade pip wheel setuptools

Also consider manually setting the number of jobs (64 requires ~500GB ram so adjust accordingly).

MAX_JOBS=64 python -m pip -v install flash-attn --no-build-isolation

Without both changes it defaulted to just using a single compilation job for me, taking forever (I gave up after an hour).

Maybe this could be added to the ninja disclaimer in the readme, @tridao , although I guess the recommended nvidia container has matching versions installed already.

Antony-M1 commented 1 day ago

I tried in the Google Colab L4 machine its taking too much time and I tried in the Kaggle GPU P100 its installed with in 5 sec. I don't know what's wrong with the google colab.