NVIDIA / TensorRT

NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.
https://developer.nvidia.com/tensorrt
Apache License 2.0
10.52k stars 2.1k forks source link

The KL divergence calculation is very slow and is not optimized for acceleration #4023

Open yychen2000 opened 1 month ago

yychen2000 commented 1 month ago

Description

I tried to quote the following documents directly,tools/pytorch-quantization/pytorch_quantization/calib/histogram.py,and Use HistogramCalibrator.compute_amax() to calculate the maximum value that needs to be truncated for quantization,However, if this step is added to the network for inference, and the inference speed is found to be abnormally slow, is there a corresponding acceleration optimization?

Y4 1P0O~1 ~Z9OI`$1VY319

lix19937 commented 1 month ago

compute_amax function is use with np, so it run on cpu, if you want to run fast, you can use torch rewrite it(use torch cuda backend), btw, you can track https://github.com/NVIDIA/TensorRT-Model-Optimizer

yychen2000 commented 1 month ago

compute_amax function is use with np, so it run on cpu, if you want to run fast, you can use torch rewrite it(use torch cuda backend), btw, you can track https://github.com/NVIDIA/TensorRT-Model-Optimizer

I looked at the block of the source code about calculating the histogram data and saw that it seemed to use a circular dichotomous lookup. The torch.backend you mentioned can be called as a generic tensor computation method. Do we need a custom CUDA operator to solve this problem? Thank you for the link, TensorRT Model Optimizer seems to accelerate LLM inference, I see optimization methods such as smoothquant and awq, which don't seem to be mentioned in the acceleration method of the cnn network. The method of calculating the truncation kl divergence of the activation value boundary mentioned in the NVIDIA document should be a general calculation method, and putting the behavior of this calibration boundary in the inference stage will cause the calibration process to be extremely slow, will the simple optimization method of scaling down the calibration data set have too much impact on the quantization effect?

lix19937 commented 1 month ago

Do we need a custom CUDA operator to solve this problem?

Yes, you can write cuda kernel(layer) to solve this.

and putting the behavior of this calibration boundary in the inference stage will cause the calibration process to be extremely slow, will the simple optimization method of scaling down the calibration data set have too much impact on the quantization effect?

Calibration is do once and offline. The calibration dataset needs to be representative and class balanced, usually 500~1000 samples.

BTW:
Pytorch Quantization development has transitioned to the TensorRT Model Optimizer

https://github.com/NVIDIA/TensorRT/tree/release/10.2/tools/pytorch-quantization

Note: Pytorch Quantization development has transitioned to the TensorRT Model Optimizer. All developers are encouraged to use the TensorRT Model Optimizer to benefit from the latest advancements on quantization and compression. While the Pytorch Quantization code will remain available, it will no longer receive further development.