Open Boom-Hacker opened 1 year ago
Yes, that branch is very old. I made random fixes while debugging and only managed to bring it to a point where it can achieve a score of 25it/s. According to reports, using this commit of ROCm LLVM can reach 30it/s.
The submodule in this branch is linked to the specified branch of Composable Kernel, which has a Fused Attention implementation for Navi 3x.
I spent a lot of time trying to integrate this Fused Attention into PyTorch before. And you can find my efforts here:
If you're interested, you can check out the repos in the org to conduct further research.
i only ran aitemplate in navi3_rel_ver_1.0,it is so old