evshiron commented 1 year ago

BitsAndBytes

GPTQ for LLaMA

https://github.com/WapaMario63/GPTQ-for-LLaMa-ROCm

AutoGPTQ

https://github.com/are-we-gfx1100-yet/AutoGPTQ-rocm

Good performance. 43it/s for 7B, 25it/s for 13B, 15it/s for 30B, 0.25it/s for 40B 3bit, 1 beam.

Triton

Navi 3x support is currently work in progress. Stay tuned.

13% performance compared to rocBLAS, when running 03-matrix-multiplication, with this branch, which is merged back recently.

There is still a lot of room for improvement.

AITemplate

Navi 3x support is currently work in progress. Stay tuned.

Reach 25it/s in generating a 512x512 image with Stable Diffusion, with this branch.

Somewhat disappointing. Is this really the limit of the RX 7900 XTX?

Flash Attention

To be ported to Navi 3x.

ROCm

ROCm 5.6.0 is available now, but we can't find Windows support anywhere.

I think it might be more appropriate to call it ROCm 5.5.2.

DarkAlchy commented 1 year ago

A 3090 vs 7900XTX is about the same speed if the 3090 uses xformers and the XTX uses sub-quad in ComfyUI as Stable Diffusion XL 1.0 30 steps Euler 1024x1024. I expect a performance lift once we get the flash attention, then SDP.

evshiron commented 1 year ago

Yeah. There are Attention impl for Navi 3x in Composable Kernel, which is used in AITemplate, which is said to be 30it/s for Stable Diffusion.

I made a dirty Flash Attention impl and integrated into PyTorch before but I didn't see performance difference compared to the default math impl, and the generated images are meaningless.

There are too many parameters in CK and it's hard to correctly port XDL code for WMMA, and I give up.

DarkAlchy commented 1 year ago

Flash is comming and supposedly that will allow Pytorch 2's SDP?

evshiron commented 1 year ago

https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html

SDP is currently available for Navi 3x, but among the three underlying implementations of SDP: Flash Attention, Memory Efficient Attention, and the math impl, Navi 3x can only use the last one, which is just invoking PyTorch methods from C++ and does not offer substantial optimization.

The current development of Flash Attention for ROCm is focused on CDNA, and I don't know when RDNA will truly be able to utilize Flash Attention. All I can say is that there is potential.

DarkAlchy commented 1 year ago

Very sad that a card with all this potential hardware wise is falling down on the software side.

evshiron / rocm_lab

Roadmap #2

BitsAndBytes

GPTQ for LLaMA

AutoGPTQ

Triton

AITemplate

Flash Attention

ROCm