Dao-AILab / flash-attention

Fast and memory-efficient exact attention
BSD 3-Clause "New" or "Revised" License
13.81k stars 1.27k forks source link

llama_new_context_with_model: flash_attn is not compatible with attn_soft_cap - forcing off #1081

Open sysuls1 opened 2 months ago

sysuls1 commented 2 months ago

Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes

CUDA_VISIBLE_DEVICES=0 ./llama-server --host 0.0.0.0 --port 8008 -m /home/kemove/model/gemma-2-27b-it-Q5_K_S.gguf -ngl 99 -t 4 -np 4 -ns 4 -c 512 -fa

tridao commented 2 months ago

cool

sysuls1 commented 2 months ago

How should I address this issue to be able to utilize flash_attn properly?

transcendReality commented 2 months ago

How is no one addressing this? LMStudio doesn't work for me at all anymore. It's bricked.

tridao commented 2 months ago

Feel free to work on it if you need it. We welcome contributions.