[Feature request] Implement CPU dynamic quantization

ggerganov / whisper.cpp

Port of OpenAI's Whisper model in C/C++

MIT License

34.42k stars 3.5k forks source link

[Feature request] Implement CPU dynamic quantization #104

Open pablogranolabar opened 1 year ago

pablogranolabar commented 1 year ago

e.g. https://github.com/MiscellaneousStuff/openai-whisper-cpu

jafri commented 1 year ago

@pablogranolabar would this also cut the model size? By a factor of 4?

pablogranolabar commented 1 year ago

Yah that's the hope. Digging into it today to do some memory profiling.

ggerganov commented 1 year ago

Can you provide some more details how the "dynamic quantization" works in PyTorch? If it is just converting the weights to 8-bit floating point numbers, then the memory reduction factor will be at most x2.

meakbiyik commented 1 year ago

@ggerganov for FP16, yup. For FP32, 4x. As far as I understand your implementation switches between the two, so the benefit might be slightly more than 2x? The up-to-date documentation of PyTorch is here.

The performance increase is not necessarily parallel, but might be similar. Though, I imagine the increase would be more obvious for non-ARM users without FP16 vector arithmetic. The danger here is the loss of accuracy, I am not sure how robust are the whisper models for this and whether the above repo has anything to remedy that (here are PyTorch recommendations on this). Overall, small models should not be subjected to quantization AFAIK but the relatively large models might benefit immensely.

Great work BTW, loving this repo :)

ggerganov commented 1 year ago

Yes - there are some tensors from the model that are currently FP32 instead of FP16, because it was easier to first implement the operations in FP32 mode. See this comment for more information: https://github.com/ggerganov/whisper.cpp/issues/132#issuecomment-1311891779

At some point we should convert all tensors of the model to FP16 - this is what the original model uses, so it should be stable. But I am not really worried for now, because I don't expect big performance benefit - it's mostly 1-dimensional bias tensors left to convert which are very small.