Open hillct opened 11 months ago
Updated with dockerfile for es environment above
Hi,
Have you tried setting TORCH_CUDA_ARCH_LIST=8.7
? This will only build the kernels for your architecture, and it should be much faster. You can also add MAX_JOBS=8
as well to make sure you don't go above your RAM limit in Docker.
We don't support importing your existing flash-attention installation at the moment, mainly because the API is not stable and we could have issues there.
Hi, Have you tried setting
TORCH_CUDA_ARCH_LIST=8.7
? This will only build the kernels for your architecture, and it should be much faster. You can also addMAX_JOBS=8
as well to make sure you don't go above your RAM limit in Docker.
This is our current approach, which fails when attempting to compile the submodule. I was actually able to use MAX_JOBS=12. The behavior is that the flash_attn build is not picking up TORCH_CUDA_ARCH_LIST. There are wide variety of projects out there where the build process requires integer notation rather than decimal notation of cuda architecture. I've never been able to find a definitive answer as to why one might be used over the other notation. Presumably one is outdated but I'm not clear on which that is.
We don't support importing your existing flash-attention installation at the moment, mainly because the API is not stable and we could have issues there.
try https://github.com/facebookresearch/xformers/issues/960 , make sure your project is download complete
🐛 Bug
Even when Flash Attention and Cutlass are already installed, xformers setup.py attempts and fails to complete building of flash_attn rather than properly recognizing then as already installed. This failure completely prevents installation of xformers in my case. A build log is (will be) attached. This issue is observed on an Nvidia AGX Orin which is an arm64 platforn running ubuntu 20.04 LTS.
Command
pip install -v -U git+https://github.com/facebookresearch/xformers.git@main#egg=xformers
To Reproduce
On an arm64 system having Nvidia compute capability 87 (Jetson AGX Orin) running Ubuntu 20.04, Python 3.8, CUDA 11.8 and PyTorch 2.1.0 Steps to reproduce the behavior:
MAX_JOBS=8 pip install flash-attn --no-build-isolation
which will sucessfully build and install Flash AttentionMAX_JOBS=8 pip install cutlass
which will also complete successfullypip install -v -U git+https://github.com/facebookresearch/xformers.git@main#egg=xformers
and observe the build failuresFirst, the build process includes compilation of flash_attn and culass from submodules, which shouldn't be required, uless hey're being patched in some way. The build will fail with what appears to be attempts to compile for sm_80 and sm_86 (A100 card) but not sm_87 which is required. It should be noted that when flash_attn is built and installed on it's own in step 2 above, i works fine, so the build is only broken when the subprocess of the xformers build is initiated with what appears to be hardcoded compute capabilities.
Expected behavior
The build should complete without errors. his could b achieved by fixing the subprocess call relating to flash_attn, or, by properly detecting the already installed library.
Environment
Collecting environment information... PyTorch version: 2.1.0a0+git7bcf7da Is debug build: False CUDA used to build PyTorch: 11.8 ROCM used to build PyTorch: N/A
OS: Ubuntu 20.04.6 LTS (aarch64) GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 Clang version: Could not collect CMake version: version 3.24.3 Libc version: glibc-2.31
Python version: 3.8.10 (default, May 26 2023, 14:05:08) [GCC 9.4.0] (64-bit runtime) Python platform: Linux-5.10.104-tegra-aarch64-with-glibc2.29 Is CUDA available: True CUDA runtime version: 11.8.89 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: Orin 32GB Nvidia driver version: 520 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True
CPU: Architecture: aarch64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 12 On-line CPU(s) list: 0-11 Thread(s) per core: 1 Core(s) per socket: 4 Socket(s): 3 Vendor ID: ARM Model: 1 Model name: ARMv8 Processor rev 1 (v8l) Stepping: r0p1 CPU max MHz: 2201.6001 CPU min MHz: 115.2000 BogoMIPS: 62.50 L1d cache: 768 KiB L1i cache: 768 KiB L2 cache: 3 MiB L3 cache: 6 MiB Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1: Mitigation; __user pointer sanitization Vulnerability Spectre v2: Not affected Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp uscat ilrcpc flagm
Versions of relevant libraries: [pip3] numpy==1.17.4 [pip3] torch==2.1.0a0+git7bcf7da [conda] Could not collect
Additional context
xformers-buildlog.txt Dockerfile for test environment, meant for the AGX Orin aarm64 SBC: https://github.com/hillct/jetson-containers-extra/blob/develop/Dockerfile