RWKV / rwkv.cpp

INT4/INT5/INT8 and FP16 inference on CPU for RWKV language model
MIT License
1.37k stars 90 forks source link

question about the quantization #81

Closed irasin closed 1 year ago

irasin commented 1 year ago

How to generate the quantized INT4, INT5 and INT8 model?

Do you use GPTQ/RPTQ or normal per-tensor/per-channel PTQ? For quantized int8 model? Do you use int8 @ int8 -> int32 cublas?

saharNooby commented 1 year ago
  1. How to quantize: follow README.md.

  2. "Do you use GPTQ/RPTQ": no; maybe they are experimenting with it in upstream ggml, but currently tensors are just split into fixed-size blocks of size 32 and then quantized block-wise.

  3. "Do you use int8 @ int8 -> int32 cublas": don't know... You may check out ggml CUDA code.