question about the quantization

RWKV / rwkv.cpp

INT4/INT5/INT8 and FP16 inference on CPU for RWKV language model

MIT License

1.37k stars 90 forks source link

Closed irasin closed 1 year ago

irasin commented 1 year ago

How to generate the quantized INT4, INT5 and INT8 model?

Do you use GPTQ/RPTQ or normal per-tensor/per-channel PTQ? For quantized int8 model? Do you use int8 @ int8 -> int32 cublas?

saharNooby commented 1 year ago

How to quantize: follow README.md.
"Do you use GPTQ/RPTQ": no; maybe they are experimenting with it in upstream ggml, but currently tensors are just split into fixed-size blocks of size 32 and then quantized block-wise.
"Do you use int8 @ int8 -> int32 cublas": don't know... You may check out ggml CUDA code.