Closed irasin closed 1 year ago
How to quantize: follow README.md.
"Do you use GPTQ/RPTQ": no; maybe they are experimenting with it in upstream ggml
, but currently tensors are just split into fixed-size blocks of size 32 and then quantized block-wise.
"Do you use int8 @ int8 -> int32 cublas": don't know... You may check out ggml
CUDA code.
How to generate the quantized INT4, INT5 and INT8 model?
Do you use GPTQ/RPTQ or normal per-tensor/per-channel PTQ? For quantized int8 model? Do you use int8 @ int8 -> int32 cublas?