Dao-AILab / flash-attention

Fast and memory-efficient exact attention
BSD 3-Clause "New" or "Revised" License
13.62k stars 1.25k forks source link

flash-attention imported, not running #932

Open Jacck opened 5 months ago

Jacck commented 5 months ago

I get warning: You are not running the flash-attention implementation, expect numerical differences. I just run basic inference using model Microsoft Phi-3-mini-128k-instruct with cuda. I have Nvidia GeForce RTX 2080, Driver Version: 546.12, CUDA Version: 12.3. Bitsandbytes Version: 0.43.1. In addition, I get warning: Current flash-attenton does not support window_size. Either upgrade or use attn_implementation='eager'How to resolve it, Thx.

tridao commented 5 months ago

2080 (Turing) is not supported in the latest version.

tutuandyang commented 3 months ago

我在跑mini-internvl-4b预训练模型的时候也遇到了这样的问题:I get warning: You are not running the flash-attention implementation, expect numerical differences. A100服务器。torch version: 2.1.0a0+4136153,flash-attn version: 2.3.6,transformers version: 4.41.2