Open jiwonsong-dev opened 2 months ago
Hi @jiwonsong-dev,
Thanks for your interest in this project!
For supporting different input shapes, GEMM kernel can be used without any performance tuning by changing https://github.com/efeslab/Atom/blob/7e3618b1a7a7c86e1c93cc909b1510c046d76ac6/kernels/src/GEMM/bench_dense_layer_gemm_i4_o16.cu#L69. Attention kernels can be used as different models are naturally supported by FlashInfer. However, the REORDER and RMS_NORM kernels' design are coupled with the shape. The blockDim are hard coded (https://github.com/efeslab/Atom/blob/7e3618b1a7a7c86e1c93cc909b1510c046d76ac6/kernels/include/Reorder/Reorder.cuh#L217) and the last block are forced to quantize the INT8 outliers (https://github.com/efeslab/Atom/blob/7e3618b1a7a7c86e1c93cc909b1510c046d76ac6/kernels/include/Reorder/Reorder.cuh#L171). Some kernel efforts are needed for support larger shapes.
Hi, thank you for great work and efforts.
Current kernels seem to support only dimensions of 7B models with hidden dimension 4096. How can I extend it for larger models like Llama-30B or 65B? It returns an error when I just add template instances for larger dimension.
Thank you.