ELS-RD / transformer-deploy

Efficient, scalable and enterprise-grade CPU/GPU inference server for 🤗 Hugging Face transformer models 🚀
https://els-rd.github.io/transformer-deploy/
Apache License 2.0
1.64k stars 150 forks source link

Support for gpt2 quantization #52

Open kobzaond opened 2 years ago

kobzaond commented 2 years ago

I tried to quantize (add QDQ layers) the gpt2 model:

batch_size=8
        with QATCalibrate(method="histogram", percentile=99.999) as qat:
            model_q = self.model.cuda()
            qat.setup_model_qat(model_q)  # prepare quantizer to any model

            with torch.no_grad():
                for start_index in range(0, 650, batch_size):
                    end_index = start_index + batch_size
                    data = self.data[start_index:end_index]
                    data = self.tokenizer(data, return_tensors='pt', padding=True, truncation=True, max_length=512)
                    input_torch = {
                        k: torch.tensor(v, dtype=torch.long, device="cuda")
                        for k, v in data.items()
                        if k in ["input_ids", "attention_mask", "token_type_ids"]
                    }
                    model_q(**input_torch)

but no QDQ layers were inserted - I assume that you don't support GPT2 yet. Do you plan add it?

pommedeterresautee commented 2 years ago

Indeed we have not yet done it, but it should be fairly simple.

You can call patch_model (https://github.com/ELS-RD/transformer-deploy/blob/main/src/transformer_deploy/QDQModels/patch.py#L44) and for an example of simple module: https://github.com/ELS-RD/transformer-deploy/blob/main/src/transformer_deploy/QDQModels/QDQAlbert.py

Let me know if it's clear for you.

kobzaond commented 2 years ago

Thank you for your response. I tried to make QDQGPT2.py with the same pattern as QDQBert.py or QDQElectra.py... and added the new patch module into the list in https://github.com/ELS-RD/transformer-deploy/blob/main/src/transformer_deploy/QDQModels/patch.py#L44.

But actually I was not able to fully understand how the quantization is working - I got that you insert the QDQ layers but got lost in the code. Anyway, afterward I tried to quantize the GPT2 model, which worked, except that certain layers have amax value 'nan'. e.g.: ` (11): GPT2Block( (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (attn): GPT2Attention( (c_attn): Conv1D() (c_proj): Conv1D() (attn_dropout): Dropout(p=0.1, inplace=False) (resid_dropout): Dropout(p=0.1, inplace=False) (matmul_quantizer_0): TensorQuantizer(8bit fake per-tensor amax=5.6953 calibrator=HistogramCalibrator scale=1.0 quant) (matmul_quantizer_1): TensorQuantizer(8bit fake per-tensor amax=5.9871 calibrator=HistogramCalibrator scale=1.0 quant) (matmul_quantizer_2): TensorQuantizer(8bit fake per-tensor amax=0.9995 calibrator=HistogramCalibrator scale=1.0 quant) (matmul_quantizer_3): TensorQuantizer(8bit fake per-tensor amax=13.3477 calibrator=HistogramCalibrator scale=1.0 quant) (matmul_quantizer_4): TensorQuantizer(8bit fake per-tensor amax=nan calibrator=HistogramCalibrator scale=1.0 quant) (matmul_quantizer_5): TensorQuantizer(8bit fake per-tensor amax=nan calibrator=HistogramCalibrator scale=1.0 quant)

`

Then I tried to convert the model into onnx and tensorrt, both worked. However, in tensorrt the speed is slower than with fp32 precision. Do you have any idea, why it is so slow?

pommedeterresautee commented 2 years ago

Have you build engine with int 8 support?

kobzaond commented 2 years ago

Yes, I've set both fp16 and int8 flags

config.set_flag(trt.BuilderFlag.INT8)
config.set_flag(trt.BuilderFlag.FP16)

basically I used analogical code to your quantization demo, only the model changed; I can share some of my measurements (in seconds - it is an average over 20 runs, sample is always the same).:

<google-sheets-html-origin><style type="text/css"><!--td {border: 1px solid #ccc;}br {mso-data-placement:same-cell;}--></style>

tensorrt fp16, batch1 | 0.0052
-- | --
tensorrt fp16, batch8 | 0.058
tensorrt int8, batch1 | 0.016
tensorrt int8, batch8 | 0.124
pommedeterresautee commented 2 years ago

have you checked that your local tensorrt version is the same than the docker image you use?

kobzaond commented 2 years ago

I am not using any docker image, all is installed either in python virtual environment, conda environment or locally, so there shouldn't be any version missmatch