Dao-AILab / flash-attention

Fast and memory-efficient exact attention
BSD 3-Clause "New" or "Revised" License
13.59k stars 1.24k forks source link

add support for AMD / ROCm / HIP #707

Open ehartford opened 9 months ago

ehartford commented 9 months ago

I want to again request AMD support, since it is now much more popular and usable than it has been

wsippel commented 9 months ago

AMD is working on it: https://github.com/ROCmSoftwarePlatform/flash-attention

I've not tested it yet, but a new branch with WMMA optimizations for Radeon 7000 was added just yesterday it seems.

nktice commented 9 months ago

I have composed this guide for my AMD AI configuration... https://github.com/nktice/AMD-AI The ROCm project that had done flash attention has appeared to work with 5.73. [ https://github.com/nktice/AMD-AI/blob/main/ROCm-5.7.md - I've not tested much, but the exllamav2 warnings that appear when it's not in use disappear once it's installed in this case... ]

Alas it does not work with the ROCm 6 at time of writing. [ https://github.com/nktice/AMD-AI/blob/main/ROCm6.0.md - in this case exllamav2 crashes if flash attention ( same as above ) is installed. ]

An issue with this is that the AMD fork is always behind and hard to maintain compared to the main content and developers.

What would be helpful is for AMD's content to be included back into the source, so that they do not have to start from scratch again every time there is any update to the main flash-attention code.

ehartford commented 8 months ago

@tridao is it possible to merge this to support ROCm?

https://github.com/ROCmSoftwarePlatform/flash-attention

tridao commented 8 months ago

https://github.com/ROCmSoftwarePlatform/flash-attention

I think that's a fork maintained by AMD folks and it's not meant to be merged.

ehartford commented 8 months ago

I doubt they would disapprove of merging, Seems just a rift of communication. I will reach out.

nktice commented 5 months ago

https://github.com/ROCmSoftwarePlatform/flash-attention

I think that's a fork maintained by AMD folks and it's not meant to be merged.

As it's been a while, and they haven't updated or integrated... I'd like to mention - AMD rarely updates or maintains such things...
It's common for them to abandon such projects with little notice...
Like for example their bits-and-bytes conversion is well out of date - https://github.com/ROCm/bitsandbytes Leading to others improvising for themselves to get things working - [ Here's the most recent working bitsandbytes I've found that works with ROCm... it's well out of date, but not quite as abandoned as AMD's own... ] https://github.com/arlo-phoenix/bitsandbytes-rocm-5.6 There's been no quarrel about peoples' forked versions, and there are a few - but without their help it is something of a mess of mixed offerings.

It is more likely they offered an example of what could be done - and how to do it, so that the 'community' could take it from there. [ If that's not the case, then they'd clearly mention that, or keep it private. ]

I have contacted exllamav2 about the version issue, here is what they said - AMD's offered version isn't of much use... https://github.com/turboderp/exllamav2/issues/397#issuecomment-2034652594

RichardFevrier commented 5 months ago

Maybe @howiejayz could be part of this conversation =)

jayz0123 commented 5 months ago

Maybe @howiejayz could be part of this conversation =)

Unfortunately I am no longer working on this project :( But as far as I know the other team is still working on this project and it will be long-term support.

rocking5566 commented 3 months ago

I just submit an PR to support AMD / ROCm on FlashAttention 2 https://github.com/Dao-AILab/flash-attention/pull/1010 This PR using composable_kernel as backend

wsippel commented 3 months ago

Looking at the compile targets, this patch only works on CDNA GPUs i assume? Is RDNA3 support still in the cards?

rocking5566 commented 3 months ago

@wsippel Yes, The new PR only works for MI200 and MI300 for now.

ehartford commented 3 months ago

I have mi100s, would love to be able to use them

rocking5566 commented 3 months ago

I have mi100s, would love to be able to use them

We found MI100 may fail in some of the bf16 test cases. Hence, MI100 is not officially support for now.

iratebadger commented 2 weeks ago

I would like to look into this bf16 issue. Is the cause well understood or in need of research?

rocking5566 commented 1 week ago

I would like to look into this bf16 issue. Is the cause well understood or in need of research?

We focus on MI300 improvement recently, but MI100 is still in our backlog

ehartford commented 1 week ago

I have 24 mi100s, I would much want to add support for mi100s, Is there anything I can do to help?

iratebadger commented 1 week ago

I would like to concur with ehartford. I'm trying to get the AMD folks to provide more info on the cause of a page fault during the tests which according to ROM folks is a FA issue.

rocking5566 commented 1 week ago

I would like to concur with ehartford. I'm trying to get the AMD folks to provide more info on the cause of a page fault during the tests which according to ROM folks is a FA issue.

@iratebadger are you using main branch here with MI200 or MI300? But as I know, bf16 in MI100 only 92.3 TFLOPs, fp16 is better (184.6 TFLOPs)

rocking5566 commented 1 week ago

I have 24 mi100s, I would much want to add support for mi100s, Is there anything I can do to help?

@ehartford You should ask your AMD sales to increase the priority of MI100 in our roadmap. But you could also try Fp16 in MI100

ehartford commented 1 week ago

I have 24 mi100s, I would much want to add support for mi100s, Is there anything I can do to help?

@ehartford You should ask your AMD sales to increase the priority of MI100 in our roadmap. But you could also try Fp16 in MI100

Thank you for this advice! 😁