Open obhalerao97 opened 2 months ago
me too
Could it be that it's just taking a long time?
Same here
MAX_JOBS=4 pip -v install flash-attn==2.6.3 --no-build-isolation
I used verbose option ; it gets stuck in C++ compilation indefinitely. I tried other versions but same problem.
copying flash_attn/ops/triton/init.py -> build/lib.linux-x86_64-cpython-310/flash_attn/ops/triton copying flash_attn/ops/triton/cross_entropy.py -> build/lib.linux-x86_64-cpython-310/flash_attn/ops/triton copying flash_attn/ops/triton/k_activations.py -> build/lib.linux-x86_64-cpython-310/flash_attn/ops/triton copying flash_attn/ops/triton/layer_norm.py -> build/lib.linux-x86_64-cpython-310/flash_attn/ops/triton copying flash_attn/ops/triton/linear.py -> build/lib.linux-x86_64-cpython-310/flash_attn/ops/triton copying flash_attn/ops/triton/mlp.py -> build/lib.linux-x86_64-cpython-310/flash_attn/ops/triton copying flash_attn/ops/triton/rotary.py -> build/lib.linux-x86_64-cpython-310/flash_attn/ops/triton running build_ext /lustre1/tier2/users/puneesh.khanna/miniconda3/envs/falcon-moe/lib/python3.10/site-packages/torch/utils/cpp_extension.py:418: UserWarning: The detected CUDA version (12.2) has a minor version mismatch with the version that was used to compile PyTorch (12.1). Most likely this shouldn't be a problem. warnings.warn(CUDA_MISMATCH_WARN.format(cuda_str_version, torch.version.cuda)) /lustre1/tier2/users/puneesh.khanna/miniconda3/envs/falcon-moe/lib/python3.10/site-packages/torch/utils/cpp_extension.py:428: UserWarning: There are no g++ version bounds defined for CUDA version 12.2 warnings.warn(f'There are no {compiler_name} version bounds defined for CUDA version {cuda_str_version}') building 'flash_attn_2_cuda' extension creating /tmp/pip-install-14eos5qz/flash-attn_021be3b5eaac41e793324f2128cf5d4c/build/temp.linux-x86_64-cpython-310 creating /tmp/pip-install-14eos5qz/flash-attn_021be3b5eaac41e793324f2128cf5d4c/build/temp.linux-x86_64-cpython-310/csrc creating /tmp/pip-install-14eos5qz/flash-attn_021be3b5eaac41e793324f2128cf5d4c/build/temp.linux-x86_64-cpython-310/csrc/flash_attn creating /tmp/pip-install-14eos5qz/flash-attn_021be3b5eaac41e793324f2128cf5d4c/build/temp.linux-x86_64-cpython-310/csrc/flash_attn/src Emitting ninja build file /tmp/pip-install-14eos5qz/flash-attn_021be3b5eaac41e793324f2128cf5d4c/build/temp.linux-x86_64-cpython-310/build.ninja... Compiling objects... Using envvar MAX_JOBS (4) as the number of workers... [1/85] c++ -MMD -MF /tmp/pip-install-14eos5qz/flash-attn_021be3b5eaac41e793324f2128cf5d4c/build/temp.linux-x86_64-cpython-310/csrc/flash_attn/flash_api.o.d -pthread -B /lustre1/tier2/users/puneesh.khanna/miniconda3/envs/venv/compiler_compat -Wno-unused-result -Wsign-compare -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /lustre1/tier2/users/puneesh.khanna/miniconda3/envs/venv/include -fPIC -O2 -isystem /lustre1/tier2/users/puneesh.khanna/miniconda3/envs/venv/include -fPIC -I/tmp/pip-install-14eos5qz/flash-attn_021be3b5eaac41e793324f2128cf5d4c/csrc/flash_attn -I/tmp/pip-install-14eos5qz/flash-attn_021be3b5eaac41e793324f2128cf5d4c/csrc/flash_attn/src -I/tmp/pip-install-14eos5qz/flash-attn_021be3b5eaac41e793324f2128cf5d4c/csrc/cutlass/include -I/lustre1/tier2/users/puneesh.khanna/miniconda3/envs/venv/lib/python3.10/site-packages/torch/include -I/lustre1/tier2/users/puneesh.khanna/miniconda3/envs/venv/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/lustre1/tier2/users/puneesh.khanna/miniconda3/envs/venv/lib/python3.10/site-packages/torch/include/TH -I/lustre1/tier2/users/puneesh.khanna/miniconda3/envs/venv/lib/python3.10/site-packages/torch/include/THC -I/usr/local/cuda/include -I/lustre1/tier2/users/puneesh.khanna/miniconda3/envs/venv/include/python3.10 -c -c /tmp/pip-install-14eos5qz/flash-attn_021be3b5eaac41e793324f2128cf5d4c/csrc/flash_attn/flash_api.cpp -o /tmp/pip-install-14eos5qz/flash-attn_021be3b5eaac41e793324f2128cf5d4c/build/temp.linux-x86_64-cpython-310/csrc/flash_attn/flash_api.o -O3 -std=c++17 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0
It could be that there's lots of swapping going on, building takes a lot of RAM (or at least used to, maybe that's outdated these days). Maybe you have better luck setting MAX_JOBS=1
.
I have a 12900K and an RTX 4090, and an SSD and it still took about 2 hours to finish installing. It's not stuck. Check your Activity Monitor CPU usage to see if things are still moving along.
When trying to build the .so files by doing
python3 setup.py install
, it's getting stuck. I have ninja installed too. @janEbert @tridao