PygmalionAI / aphrodite-engine

PygmalionAI's large-scale inference engine
https://pygmalion.chat
GNU Affero General Public License v3.0
720 stars 85 forks source link

[Bug]: Flash attention cannot be used on v0.5.3 #468

Open Nero10578 opened 1 month ago

Nero10578 commented 1 month ago

Your current environment

./runtime.sh python env.py
Collecting environment information...
PyTorch version: 2.3.0
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A
OS: Ubuntu 22.04.4 LTS (x86_64)
GCC version: (conda-forge gcc 11.3.0-19) 11.3.0
Clang version: Could not collect
CMake version: version 3.29.3
Libc version: glibc-2.35
Python version: 3.11.9 | packaged by conda-forge | (main, Apr 19 2024, 18:36:13) [GCC 12.3.0] (64-bit runtime)
Python platform: Linux-5.15.146.1-microsoft-standard-WSL2-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 12.1.105
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 4090
Nvidia driver version: 552.22
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
CPU:
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      39 bits physical, 48 bits virtual
Byte Order:                         Little Endian
CPU(s):                             16
On-line CPU(s) list:                0-15
Vendor ID:                          GenuineIntel
Model name:                         11th Gen Intel(R) Core(TM) i9-11900K @ 3.50GHz
CPU family:                         6
Model:                              167
Thread(s) per core:                 2
Core(s) per socket:                 8
Socket(s):                          1
Stepping:                           1
BogoMIPS:                           7007.99
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology tsc_reliable nonstop_tsc cpuid pni pclmulqdq vmx ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx512vbmi umip avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid fsrm md_clear flush_l1d arch_capabilities
Virtualization:                     VT-x
Hypervisor vendor:                  Microsoft
Virtualization type:                full
L1d cache:                          384 KiB (8 instances)
L1i cache:                          256 KiB (8 instances)
L2 cache:                           4 MiB (8 instances)
L3 cache:                           16 MiB (1 instance)
Vulnerability Gather data sampling: Unknown: Dependent on hypervisor status
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Vulnerability Retbleed:             Mitigation; Enhanced IBRS
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected
Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] torch==2.3.0
[pip3] triton==2.3.0
[conda] blas                      2.16                        mkl    conda-forge
[conda] libblas                   3.8.0                    16_mkl    conda-forge
[conda] libcblas                  3.8.0                    16_mkl    conda-forge
[conda] liblapack                 3.8.0                    16_mkl    conda-forge
[conda] liblapacke                3.8.0                    16_mkl    conda-forge
[conda] mkl                       2020.2                      256
[conda] numpy                     1.26.4                   pypi_0    pypi
[conda] pytorch                   2.3.0           py3.11_cuda12.1_cudnn8.9.2_0    pytorch
[conda] pytorch-cuda              12.1                 ha16c6d3_5    pytorch
[conda] pytorch-mutex             1.0                        cuda    pytorch
[conda] torchtriton               2.3.0                     py311    pytorchROCM Version: Could not collect
Aphrodite Version: 0.5.3
Aphrodite Build Flags:
CUDA Archs: Not Set; ROCm: Disabled

🐛 Describe the bug

I just git cloned fresh then ran ./update-runtime.sh. Then installed flash-attn with ./runtime pip install flash-attn.

Results in aphrodite not using flash-attention still even though flash-attn is installed already.

./runtime.sh python -m aphrodite.endpoints.openai.api_server \
--model /home/owen/models/Llama-3-8B-Instruct-COT-v0.1 \
--gpu-memory-utilization 0.80 --max-model-len 8192 --port 8000 --kv-cache-dtype fp8 \
--served-model-name OwenTest --enforce-eager true --max-num-seqs 160
INFO:     Using fp8 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. But it
may cause slight accuracy drop without scaling factors. FP8_E5M2 (without scaling) is only supported on cuda version
greater than 11.8. On ROCm (AMD GPU), FP8_E4M3 is instead supported for common inference criteria.
INFO:     Initializing the Aphrodite Engine (v0.5.3) with the following config:
INFO:     Model = '/home/owen/models/Llama-3-8B-Instruct-COT-v0.1'
INFO:     Speculative Config = None
INFO:     DataType = torch.bfloat16
INFO:     Model Load Format = auto
INFO:     Number of GPUs = 1
INFO:     Disable Custom All-Reduce = False
INFO:     Quantization Format = None
INFO:     Context Length = 8192
INFO:     Enforce Eager Mode = True
INFO:     KV Cache Data Type = fp8
INFO:     KV Cache Params Path = None
INFO:     Device = cuda
INFO:     Guided Decoding Backend = DecodingConfig(guided_decoding_backend='outlines')
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
WARNING:  Using 'pin_memory=False' as WSL is detected. This may slow down the performance.
INFO:     Cannot use FlashAttention backend because the flash_attn package is not found. Please install it for better
performance.
INFO:     Using XFormers backend.
INFO:     Model weights loaded. Memory usage: 14.96 GiB x 1 = 14.96 GiB
INFO:     # GPU blocks: 3082, # CPU blocks: 4096
INFO:     Minimum concurrency: 6.02x
INFO:     Maximum sequence length allowed in the cache: 49312
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO:     Using the default chat template
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO:     Started server process [11788]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
INFO:     Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0
reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
AlpinDale commented 1 month ago

Looks like installing flash-attn with our torch version doesn't work:

ImportError: /home/anon/miniconda3/envs/aphrodite/lib/python3.11/site-packages/flash_attn_2_cuda.cpython-311-x86_64-linux-gnu.so: undefined symbol: _ZN3c104cuda9SetDeviceEi

I'll look into it. Thanks for reporting.

Ph0rk0z commented 1 month ago

I have flash attention installed and compiled it from source to support new torch but it still says it isn't found. Will double check it.

I recompiled it again after deleting build and dist. Sadly doesn't work on 3 GPUs and 5bit 70b won't fit on 2 despite fitting in textgen.

Nero10578 commented 1 month ago

Looks like installing flash-attn with our torch version doesn't work:

ImportError: /home/anon/miniconda3/envs/aphrodite/lib/python3.11/site-packages/flash_attn_2_cuda.cpython-311-x86_64-linux-gnu.so: undefined symbol: _ZN3c104cuda9SetDeviceEi

I'll look into it. Thanks for reporting.

It seems to work in the new commit now

ortegaalfredo commented 1 month ago

I can use it and it works, but its slightly slower, 9tok/s activated, 11.5 tok/s deactivated, inference on Llama3-70B-8bpw, 4x3090 gpu.

Ph0rk0z commented 1 month ago

I thought VLLM supported a triton based FA for all (tensor) cards, I was hoping to try it here but instead it used the normal FA package.

Nero10578 commented 1 month ago

Looks like installing flash-attn with our torch version doesn't work:

ImportError: /home/anon/miniconda3/envs/aphrodite/lib/python3.11/site-packages/flash_attn_2_cuda.cpython-311-x86_64-linux-gnu.so: undefined symbol: _ZN3c104cuda9SetDeviceEi

I'll look into it. Thanks for reporting.

It seems to work in the new commit now

It actually stopped working again now when i try to reinstall on the latest commit. Not sure why it worked previously once.

alexanderfrey commented 3 weeks ago

same here