Open ehartford opened 9 months ago
AMD is working on it: https://github.com/ROCmSoftwarePlatform/flash-attention
I've not tested it yet, but a new branch with WMMA optimizations for Radeon 7000 was added just yesterday it seems.
I have composed this guide for my AMD AI configuration... https://github.com/nktice/AMD-AI The ROCm project that had done flash attention has appeared to work with 5.73. [ https://github.com/nktice/AMD-AI/blob/main/ROCm-5.7.md - I've not tested much, but the exllamav2 warnings that appear when it's not in use disappear once it's installed in this case... ]
Alas it does not work with the ROCm 6 at time of writing. [ https://github.com/nktice/AMD-AI/blob/main/ROCm6.0.md - in this case exllamav2 crashes if flash attention ( same as above ) is installed. ]
An issue with this is that the AMD fork is always behind and hard to maintain compared to the main content and developers.
What would be helpful is for AMD's content to be included back into the source, so that they do not have to start from scratch again every time there is any update to the main flash-attention code.
@tridao is it possible to merge this to support ROCm?
I think that's a fork maintained by AMD folks and it's not meant to be merged.
I doubt they would disapprove of merging, Seems just a rift of communication. I will reach out.
I think that's a fork maintained by AMD folks and it's not meant to be merged.
As it's been a while, and they haven't updated or integrated... I'd like to mention - AMD rarely updates or maintains such things...
It's common for them to abandon such projects with little notice...
Like for example their bits-and-bytes conversion is well out of date - https://github.com/ROCm/bitsandbytes Leading to others improvising for themselves to get things working - [ Here's the most recent working bitsandbytes I've found that works with ROCm... it's well out of date, but not quite as abandoned as AMD's own... ] https://github.com/arlo-phoenix/bitsandbytes-rocm-5.6 There's been no quarrel about peoples' forked versions, and there are a few - but without their help it is something of a mess of mixed offerings.
It is more likely they offered an example of what could be done - and how to do it, so that the 'community' could take it from there. [ If that's not the case, then they'd clearly mention that, or keep it private. ]
I have contacted exllamav2 about the version issue, here is what they said - AMD's offered version isn't of much use... https://github.com/turboderp/exllamav2/issues/397#issuecomment-2034652594
Maybe @howiejayz could be part of this conversation =)
Maybe @howiejayz could be part of this conversation =)
Unfortunately I am no longer working on this project :( But as far as I know the other team is still working on this project and it will be long-term support.
I just submit an PR to support AMD / ROCm on FlashAttention 2 https://github.com/Dao-AILab/flash-attention/pull/1010 This PR using composable_kernel as backend
Looking at the compile targets, this patch only works on CDNA GPUs i assume? Is RDNA3 support still in the cards?
@wsippel Yes, The new PR only works for MI200 and MI300 for now.
I have mi100s, would love to be able to use them
I have mi100s, would love to be able to use them
We found MI100 may fail in some of the bf16 test cases. Hence, MI100 is not officially support for now.
I would like to look into this bf16 issue. Is the cause well understood or in need of research?
I would like to look into this bf16 issue. Is the cause well understood or in need of research?
We focus on MI300 improvement recently, but MI100 is still in our backlog
I have 24 mi100s, I would much want to add support for mi100s, Is there anything I can do to help?
I would like to concur with ehartford. I'm trying to get the AMD folks to provide more info on the cause of a page fault during the tests which according to ROM folks is a FA issue.
I would like to concur with ehartford. I'm trying to get the AMD folks to provide more info on the cause of a page fault during the tests which according to ROM folks is a FA issue.
@iratebadger are you using main branch here with MI200 or MI300? But as I know, bf16 in MI100 only 92.3 TFLOPs, fp16 is better (184.6 TFLOPs)
I have 24 mi100s, I would much want to add support for mi100s, Is there anything I can do to help?
@ehartford You should ask your AMD sales to increase the priority of MI100 in our roadmap. But you could also try Fp16 in MI100
I have 24 mi100s, I would much want to add support for mi100s, Is there anything I can do to help?
@ehartford You should ask your AMD sales to increase the priority of MI100 in our roadmap. But you could also try Fp16 in MI100
Thank you for this advice! 😁
I want to again request AMD support, since it is now much more popular and usable than it has been