SpQR compression method

JianbangZ commented 1 year ago

How feasible to implement spQR into ggml? SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression

gardner commented 1 year ago

The paper: https://arxiv.org/pdf/2306.03078.pdf

The code: https://github.com/Vahe1994/SpQR

PoignardAzur commented 9 months ago

Given this comment: https://github.com/ggerganov/llama.cpp/issues/1602#issuecomment-1597142154, it seems unlikely SpQR is going to be implemented any time soon:

The main idea of the SpQR paper is to separate "outliers". This has been tried as part of k-quants development and has been shown to be less effective, see for instance https://github.com/ggerganov/llama.cpp/discussions/1595#discussioncomment-6018205 in https://github.com/ggerganov/llama.cpp/discussions/1595).

If we read the SpQR paper more carefully, we find that what they mean by "nearly lossless compression" is to arrive at a quantized perplexity within 1% of the full model. The Q4_K_M variant of k-quants does that for ggml, see for instance PR https://github.com/ggerganov/llama.cpp/pull/1684

We can probably close this issue.

ggerganov / ggml

SpQR compression method #240