SqueezeLLM - Githubissues

Iambestfeed commented 1 year ago

Hey there, I was just wondering how this compared to SqueezeLLM, the perplexity/size seems on par.

Here's their paper and repo: https://arxiv.org/abs/2306.03078 https://github.com/SqueezeAILab/SqueezeLLM

Thank you!

Vahe1994 commented 1 year ago

Hello! It seems, that we both take advantage of outliers in a fairly similar way, but the underlying quantization scheme is different. They do nonlinear (kmeans) quantization minimizing model-level loss (approx.) with no error compensation; we minimize layer-level loss with error compensation. The SqueezeLLM approach appears to be more expensive in compute/memory since it requires full model backpropagation, but both algorithms should be able to compress sota LLMs fast enough.

We played with nonlinear quantization earlier, but it gave limited/no gains when you take the memory overhead into account — however, we ran our experiments in a different scenario: layer-wise quantization for 65B models (squeezeLLM uses global loss for 7-30B models). It is plausible that their chosen configuration is more favorable for nonlinear (kmeans) quantization. It's difficult to say more without actually combining the features of both approaches and testing it thoroughly.

Iambestfeed commented 1 year ago

@Vahe1994 Thanks for the response. Honestly, what I'm really concerned about is finding some way to effectively compress and store the model. It seems like both SqQR and SqueezeLLM haven't been able to provide us with any significant progress in that regard during the model quantization process. I would greatly appreciate it if you could expedite the code delivery for quantization. Thank you.

Vahe1994 / SpQR

SqueezeLLM #14