efeslab / Atom

[MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving
277 stars 24 forks source link

Question regarding the efficiency evaluation #17

Closed FlyFoxPlayer closed 6 months ago

FlyFoxPlayer commented 6 months ago

Hello, regarding the efficiency evaluation experiment, it seems that there are only codes for evaluating the throughput and latency of Atom and SmoothQuant. I would like to ask how the throughput and latency results for FP16 and AWQ were obtained?

happierpig commented 6 months ago

Hi @FlyFoxPlayer,

Thanks for your interest and issue.

I'm sorry for leaving out the eval scripts for W4A16 and FP16. We have provided the experiments setup as well as reproduce scrips in this commit.

FlyFoxPlayer commented 5 months ago

Hello @happierpig, is the FP16.cu file also missing in the project directory kernels/baselines/src? I want to know how to evaluate FP16 baseline.

happierpig commented 5 months ago

Hi @FlyFoxPlayer,

Basically, we are using PyTorch to evaluate the performance of FP16 baselines. For e2e results, nn.Linear is used as FP16 GEMM. For kernel results, we provided scripts as here.