e2e demonstration for bigger models - Githubissues

efeslab / Atom

[MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving

278 stars 24 forks source link

e2e demonstration for bigger models #24

Open jiwonsong-dev opened 2 months ago

jiwonsong-dev commented 2 months ago

Hi, thank you for great work and efforts.

Current kernels seem to support only dimensions of 7B models with hidden dimension 4096. How can I extend it for larger models like Llama-30B or 65B? It returns an error when I just add template instances for larger dimension.

Thank you.

happierpig commented 2 months ago

Hi @jiwonsong-dev,

Thanks for your interest in this project!

For supporting different input shapes, GEMM kernel can be used without any performance tuning by changing https://github.com/efeslab/Atom/blob/7e3618b1a7a7c86e1c93cc909b1510c046d76ac6/kernels/src/GEMM/bench_dense_layer_gemm_i4_o16.cu#L69. Attention kernels can be used as different models are naturally supported by FlashInfer. However, the REORDER and RMS_NORM kernels' design are coupled with the shape. The blockDim are hard coded (https://github.com/efeslab/Atom/blob/7e3618b1a7a7c86e1c93cc909b1510c046d76ac6/kernels/include/Reorder/Reorder.cuh#L217) and the last block are forced to quantize the INT8 outliers (https://github.com/efeslab/Atom/blob/7e3618b1a7a7c86e1c93cc909b1510c046d76ac6/kernels/include/Reorder/Reorder.cuh#L171). Some kernel efforts are needed for support larger shapes.