Simplifying the quantization pipeline

NolanoOrg / cformers

SoTA Transformers with C-backend for fast inference on your CPU.

MIT License

311 stars 29 forks source link

Simplifying the quantization pipeline #9

Open kamalojasv181 opened 1 year ago

kamalojasv181 commented 1 year ago

The quantization pipeline seems very hard to use. Besides manually adding support for popular models, I think it would be a good idea if we could further automate the quantization pipeline.

As far as I can think, we only need a dict mapping the names of layers in the original model and the layers in the gpt model and we can use the same script to handle any architecture.

Thoughts @tejasvaidhyadev @Ayushk4 ?

Ayushk4 commented 1 year ago

This is true. A lot of the code needs refactoring. We need to make it easy to add new models.

They good thing is that once we add support for any major model (like GPT-J), it becomes very easy to add support for its derivatives (like GPT-JT).

I would greatly welcome suggestions on how can we improve on this.

A2va commented 1 year ago

I'm wonder if it's possible to do the whole quantization process in the python conversion script. I feel like this is much simpler than a two-step process with two different programs.

Ayushk4 commented 1 year ago

That's a good suggestion @A2va .

Do you have any suggestions on how we can quantize and save in a fast manner in python?

kamalojasv181 commented 1 year ago

I dont think we need python for that. Like we have all the weights saved in the ggml model. We just need the computation graph. Onnx does this by doing a forward pass and saving a static graph. We could potentially do something like that or perhaps start with onnx itself.

Ayushk4 commented 1 year ago

ONNX is a general purpose - ggml does not support all the operations like slicing and all. If we go that route, then we will have to add support for reading their computation graph and map it to GGML computation graph, write computation-graph specific rules to substitute for the missing operations. If it is worth it in the long run, we could do it. But, it will take a long time to get something tangible - a minimal viable prototype of converting ONNX computation graph to GGML for a single model.

A2va commented 1 year ago

Do you have any suggestions on how we can quantize and save in a fast manner in python?

Not directly, but I found this script in llma.cpp which take a already quantized pytorch model and convert it to a ggml model.

Quantization of LLaMa and OPT model in python: https://github.com/qwopqwop200/GPTQ-for-LLaMa

I have no idea if this is fast, but it's certainly slower than the C++ version. I had no idea until I read the README that GPTQ and Int4 quantization is different. So which of those methods the cpp programs use to quantize ?