Dao-AILab / flash-attention

Fast and memory-efficient exact attention
BSD 3-Clause "New" or "Revised" License
13.69k stars 1.26k forks source link

setup.py is taking forever #1084

Open obhalerao97 opened 2 months ago

obhalerao97 commented 2 months ago

When trying to build the .so files by doing python3 setup.py install, it's getting stuck. I have ninja installed too. @janEbert @tridao

Onwaydbh commented 2 months ago

me too

janEbert commented 2 months ago

Could it be that it's just taking a long time?

puneeshkhanna commented 2 months ago

Same here

MAX_JOBS=4 pip -v install flash-attn==2.6.3 --no-build-isolation

I used verbose option ; it gets stuck in C++ compilation indefinitely. I tried other versions but same problem.

copying flash_attn/ops/triton/init.py -> build/lib.linux-x86_64-cpython-310/flash_attn/ops/triton copying flash_attn/ops/triton/cross_entropy.py -> build/lib.linux-x86_64-cpython-310/flash_attn/ops/triton copying flash_attn/ops/triton/k_activations.py -> build/lib.linux-x86_64-cpython-310/flash_attn/ops/triton copying flash_attn/ops/triton/layer_norm.py -> build/lib.linux-x86_64-cpython-310/flash_attn/ops/triton copying flash_attn/ops/triton/linear.py -> build/lib.linux-x86_64-cpython-310/flash_attn/ops/triton copying flash_attn/ops/triton/mlp.py -> build/lib.linux-x86_64-cpython-310/flash_attn/ops/triton copying flash_attn/ops/triton/rotary.py -> build/lib.linux-x86_64-cpython-310/flash_attn/ops/triton running build_ext /lustre1/tier2/users/puneesh.khanna/miniconda3/envs/falcon-moe/lib/python3.10/site-packages/torch/utils/cpp_extension.py:418: UserWarning: The detected CUDA version (12.2) has a minor version mismatch with the version that was used to compile PyTorch (12.1). Most likely this shouldn't be a problem. warnings.warn(CUDA_MISMATCH_WARN.format(cuda_str_version, torch.version.cuda)) /lustre1/tier2/users/puneesh.khanna/miniconda3/envs/falcon-moe/lib/python3.10/site-packages/torch/utils/cpp_extension.py:428: UserWarning: There are no g++ version bounds defined for CUDA version 12.2 warnings.warn(f'There are no {compiler_name} version bounds defined for CUDA version {cuda_str_version}') building 'flash_attn_2_cuda' extension creating /tmp/pip-install-14eos5qz/flash-attn_021be3b5eaac41e793324f2128cf5d4c/build/temp.linux-x86_64-cpython-310 creating /tmp/pip-install-14eos5qz/flash-attn_021be3b5eaac41e793324f2128cf5d4c/build/temp.linux-x86_64-cpython-310/csrc creating /tmp/pip-install-14eos5qz/flash-attn_021be3b5eaac41e793324f2128cf5d4c/build/temp.linux-x86_64-cpython-310/csrc/flash_attn creating /tmp/pip-install-14eos5qz/flash-attn_021be3b5eaac41e793324f2128cf5d4c/build/temp.linux-x86_64-cpython-310/csrc/flash_attn/src Emitting ninja build file /tmp/pip-install-14eos5qz/flash-attn_021be3b5eaac41e793324f2128cf5d4c/build/temp.linux-x86_64-cpython-310/build.ninja... Compiling objects... Using envvar MAX_JOBS (4) as the number of workers... [1/85] c++ -MMD -MF /tmp/pip-install-14eos5qz/flash-attn_021be3b5eaac41e793324f2128cf5d4c/build/temp.linux-x86_64-cpython-310/csrc/flash_attn/flash_api.o.d -pthread -B /lustre1/tier2/users/puneesh.khanna/miniconda3/envs/venv/compiler_compat -Wno-unused-result -Wsign-compare -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /lustre1/tier2/users/puneesh.khanna/miniconda3/envs/venv/include -fPIC -O2 -isystem /lustre1/tier2/users/puneesh.khanna/miniconda3/envs/venv/include -fPIC -I/tmp/pip-install-14eos5qz/flash-attn_021be3b5eaac41e793324f2128cf5d4c/csrc/flash_attn -I/tmp/pip-install-14eos5qz/flash-attn_021be3b5eaac41e793324f2128cf5d4c/csrc/flash_attn/src -I/tmp/pip-install-14eos5qz/flash-attn_021be3b5eaac41e793324f2128cf5d4c/csrc/cutlass/include -I/lustre1/tier2/users/puneesh.khanna/miniconda3/envs/venv/lib/python3.10/site-packages/torch/include -I/lustre1/tier2/users/puneesh.khanna/miniconda3/envs/venv/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/lustre1/tier2/users/puneesh.khanna/miniconda3/envs/venv/lib/python3.10/site-packages/torch/include/TH -I/lustre1/tier2/users/puneesh.khanna/miniconda3/envs/venv/lib/python3.10/site-packages/torch/include/THC -I/usr/local/cuda/include -I/lustre1/tier2/users/puneesh.khanna/miniconda3/envs/venv/include/python3.10 -c -c /tmp/pip-install-14eos5qz/flash-attn_021be3b5eaac41e793324f2128cf5d4c/csrc/flash_attn/flash_api.cpp -o /tmp/pip-install-14eos5qz/flash-attn_021be3b5eaac41e793324f2128cf5d4c/build/temp.linux-x86_64-cpython-310/csrc/flash_attn/flash_api.o -O3 -std=c++17 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0

janEbert commented 2 months ago

It could be that there's lots of swapping going on, building takes a lot of RAM (or at least used to, maybe that's outdated these days). Maybe you have better luck setting MAX_JOBS=1.

sergiotapia commented 2 months ago

I have a 12900K and an RTX 4090, and an SSD and it still took about 2 hours to finish installing. It's not stuck. Check your Activity Monitor CPU usage to see if things are still moving along.