hillct commented 11 months ago

🐛 Bug

Even when Flash Attention and Cutlass are already installed, xformers setup.py attempts and fails to complete building of flash_attn rather than properly recognizing then as already installed. This failure completely prevents installation of xformers in my case. A build log is (will be) attached. This issue is observed on an Nvidia AGX Orin which is an arm64 platforn running ubuntu 20.04 LTS.

Command

pip install -v -U git+https://github.com/facebookresearch/xformers.git@main#egg=xformers

To Reproduce

On an arm64 system having Nvidia compute capability 87 (Jetson AGX Orin) running Ubuntu 20.04, Python 3.8, CUDA 11.8 and PyTorch 2.1.0 Steps to reproduce the behavior:

Install dependencies listed above - I'll update the ticket with a Dockerfile ha provides an appropriate environment
Issue the command MAX_JOBS=8 pip install flash-attn --no-build-isolation which will sucessfully build and install Flash Attention
Issue the command MAX_JOBS=8 pip install cutlass which will also complete successfully
Issue the command pip install -v -U git+https://github.com/facebookresearch/xformers.git@main#egg=xformers and observe the build failures

First, the build process includes compilation of flash_attn and culass from submodules, which shouldn't be required, uless hey're being patched in some way. The build will fail with what appears to be attempts to compile for sm_80 and sm_86 (A100 card) but not sm_87 which is required. It should be noted that when flash_attn is built and installed on it's own in step 2 above, i works fine, so the build is only broken when the subprocess of the xformers build is initiated with what appears to be hardcoded compute capabilities.

Expected behavior

The build should complete without errors. his could b achieved by fixing the subprocess call relating to flash_attn, or, by properly detecting the already installed library.

Environment

Collecting environment information... PyTorch version: 2.1.0a0+git7bcf7da Is debug build: False CUDA used to build PyTorch: 11.8 ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.6 LTS (aarch64) GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 Clang version: Could not collect CMake version: version 3.24.3 Libc version: glibc-2.31

Python version: 3.8.10 (default, May 26 2023, 14:05:08) [GCC 9.4.0] (64-bit runtime) Python platform: Linux-5.10.104-tegra-aarch64-with-glibc2.29 Is CUDA available: True CUDA runtime version: 11.8.89 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: Orin 32GB Nvidia driver version: 520 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True

CPU: Architecture: aarch64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 12 On-line CPU(s) list: 0-11 Thread(s) per core: 1 Core(s) per socket: 4 Socket(s): 3 Vendor ID: ARM Model: 1 Model name: ARMv8 Processor rev 1 (v8l) Stepping: r0p1 CPU max MHz: 2201.6001 CPU min MHz: 115.2000 BogoMIPS: 62.50 L1d cache: 768 KiB L1i cache: 768 KiB L2 cache: 3 MiB L3 cache: 6 MiB Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1: Mitigation; __user pointer sanitization Vulnerability Spectre v2: Not affected Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp uscat ilrcpc flagm

Versions of relevant libraries: [pip3] numpy==1.17.4 [pip3] torch==2.1.0a0+git7bcf7da [conda] Could not collect

Additional context

xformers-buildlog.txt Dockerfile for test environment, meant for the AGX Orin aarm64 SBC: https://github.com/hillct/jetson-containers-extra/blob/develop/Dockerfile

hillct commented 11 months ago

Updated with dockerfile for es environment above

danthe3rd commented 11 months ago

Hi, Have you tried setting TORCH_CUDA_ARCH_LIST=8.7 ? This will only build the kernels for your architecture, and it should be much faster. You can also add MAX_JOBS=8 as well to make sure you don't go above your RAM limit in Docker. We don't support importing your existing flash-attention installation at the moment, mainly because the API is not stable and we could have issues there.

hillct commented 11 months ago

Hi, Have you tried setting TORCH_CUDA_ARCH_LIST=8.7 ? This will only build the kernels for your architecture, and it should be much faster. You can also add MAX_JOBS=8 as well to make sure you don't go above your RAM limit in Docker.

This is our current approach, which fails when attempting to compile the submodule. I was actually able to use MAX_JOBS=12. The behavior is that the flash_attn build is not picking up TORCH_CUDA_ARCH_LIST. There are wide variety of projects out there where the build process requires integer notation rather than decimal notation of cuda architecture. I've never been able to find a definitive answer as to why one might be used over the other notation. Presumably one is outdated but I'm not clear on which that is.

We don't support importing your existing flash-attention installation at the moment, mainly because the API is not stable and we could have issues there.

soulteary commented 9 months ago

try https://github.com/facebookresearch/xformers/issues/960 , make sure your project is download complete

facebookresearch / xformers

Building from source fails due o improper building of 3rdParty/flash_atn even when already installed #919

🐛 Bug

Command

To Reproduce

Expected behavior

Environment

Additional context