CompVis / stable-diffusion

A latent text-to-image diffusion model
https://ommer-lab.com/research/latent-diffusion-models/
Other
66.51k stars 9.97k forks source link

UserWarning: 1Torch was not compiled with flash attention. #850

Open stromyu520 opened 1 month ago

stromyu520 commented 1 month ago

Loading pipeline components...: 100%|██████████| 7/7 [00:02<00:00, 2.48it/s] 0%| | 0/50 [00:00<?, ?it/s]D:\ProgramData\envs\pytorch\Lib\site-packages\diffusers\models\attention_processor.py:1279: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:455.) hidden_states = F.scaled_dot_product_attention( 100%|██████████| 50/50 [00:05<00:00, 8.44it/s]

lamguy commented 1 month ago

+1 I am experiencing the exact same issue

gaoming714 commented 2 weeks ago

Warning: 1Torch was not compiled with flash attention.

First of all, let me tell you a good news. Failure usually does not affect the program running, but it is slower.

This warning is caused by the fact that after torch=2.2 update, flash attention V2 needs to be started as the optimal mechanism, but it is not successfully started.

In this blog https://pytorch.org/blog/pytorch2-2/, it is written that pytorch 2.2 has major updates

scaled_dot_product_attention (SDPA) now supports FlashAttention-2, yielding around 2x speedups compared to previous versions.

Usually, the order of function calls is FlashAttention > Memory-Efficient Attention(xformers) > PyTorch C++ implementation(math)

(I don't understand why it is designed this way, and the meaning is completely unclear from the warning. I hope the official next version will improve it)

But the pits I want to solve are the following places:

  1. It is supported in pytroch and is the first choice. The logic is that this Warning will be issued as long as flashAttentionV2 fails. (Some people have tested and found that flashAttentionV2 has not improved much)

  2. flashAttentionV2 does not have a complete ecosystem. The current official version (official website https://github.com/Dao-AILab/flash-attention) only supports Linux, and for Windows users, they can only compile the code (it is very slow anyway, even if ninja is installed). You can refer to (https://github.com/bdashore3/flash-attention/releases) for downloading third-party packages.

  3. The hardware support is at least RTX 30 or above. FlashAttention only supports Ampere GPUs or newer. In other words, it can run on 3060.

  4. There is still a small possibility that the environment cuda version and the compiled cuda version are incompatible. The official version of torch is 12.1 (torch2.* +cu121).