Quention about end-to-end efficiency evaluation of Atom

efeslab / Atom

[MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving

259 stars 21 forks source link

Quention about end-to-end efficiency evaluation of Atom #22

Closed cokeshao closed 1 month ago

cokeshao commented 1 month ago

Thanks for your great work! I have a small question here.

Why the matrix dimension is (bs, (hidden_dim - group_size) // 2)) not (bs, hidden_dim - group_size)) here? What does this "//2" mean? Is it some kind of hardware acceleration method? Can you elaborate it? Thank you. https://github.com/efeslab/Atom/blob/7e3618b1a7a7c86e1c93cc909b1510c046d76ac6/kernels/baselines/python-api.ipynb#L285-L292

happierpig commented 1 month ago

Hi @cokeshao ,

Thanks for your interest. Here we utilize PyTorch API to initialize the correct space for the INT4 tensor. Since there was no direct API for INT4 at that time, uint8 dtype was used to allocate CUDA memory. Therefore, (hidden_dim - group_size) of INT4 equals to (hidden_dim - group_size) // 2 of INT8.

cokeshao commented 1 month ago

Thank you for getting back to me so quickly. I'm overcomplicating the issue.