RWKV / rwkv.cpp

INT4/INT5/INT8 and FP16 inference on CPU for RWKV language model
MIT License
1.41k stars 99 forks source link

Implement quantization on-the-fly #100

Open saharNooby opened 1 year ago

saharNooby commented 1 year ago

This feature allows to quantize FP32/FP16 models on-the-fly to any other quantized format, without the need to explicitly run quantize.py and keep quantized models on disk.

Intended use-case is having only FP16 model saved on the disk and not wasting disk space on quantized models of all possible formats.

Furthermore, if quantization format changes again, those who use quantization on-the-fly will not even notice it, since updated rwkv.cpp will just use new format when loading the FP16 model.

saharNooby commented 1 year ago

@LoganDark Thanks for describing the roadmap! Let's wait until API redesign then. I hope it won't be too breaking :)

I'll leave this PR hanging as a draft until new the loading method is available, so that users who want to use on-the-fly quantization now can notice and use this branch.

LoganDark commented 1 year ago

I hope it won't be too breaking :)

It should be possible to reimplement the current API in terms of the new one, in order to keep compatibility with existing programs :)