ROCm / flash-attention

Fast and memory-efficient exact attention
BSD 3-Clause "New" or "Revised" License
139 stars 46 forks source link

Enable both Qloop and Kloop #5

Closed guangzlu closed 1 year ago

guangzlu commented 1 year ago

In this PR, both Qloop and Kloop are enabled. You can choose the branch by using environment parameter USE_QLOOP. If USE_QLOOP=1, then qloop is used. if USE_QLOOP=0, then kloop is used. In the setup.py file, we use USE_QLOOP=1 by default. Here is a table of performance comparision between Qloop and Kloop.

kloop.vs.qloop.xlsx

(In this table, RTZ is used and we choosed function ' flash_attn_unpadded_func ' for test) From the table, we can find that when comparing total performance (fwd + bwd), qloop is better in most cases. But when comparing fwd only, kloop is better. So we recommend that use kloop for inference, and use qloop for training.

fsx950223 commented 1 year ago

When the api is used for inference, the dropout is always 0. But there is no significant difference between qloop and kloop forward performance with dropout=0.

guangzlu commented 1 year ago

When the api is used for inference, the dropout is always 0. But there is no significant difference between qloop and kloop forward performance with dropout=0.

Yes, that's because in fwd, when dropout is 0, we have a switch that enable us not go into dropout function.

guangzlu commented 1 year ago

Please move to https://github.com/ROCmSoftwarePlatform/flash-attention/pull/6