Closed aidevmin closed 1 year ago
@ttyio ^ ^
@aidevmin , we support QAT with multi-GPU, but for calibration we suggest run on single GPU with small batch, then broadcast the model to multi-GPU.
For the performance issue, we are going to have a release with GPU acceleration for the quantization kernels in the next month. Thanks!
@ttyio I agreed wih you. I checked speed of QAT engine. It is much slower than TRT PTQ ( i checked for both TRT8.6 and TRT8.5)
Hi @aidevmin , I mean the calibration in the pytorch quantization tool, not the trt itself. And the gap between PTQ and QAT mostly come from the Q/DQ placement. Here is the doc on the placement recommendations: https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#qdq-placement-recs
@ttyio Thanks. I have one more question. Do we have to fuse Batch Normalization before QAT?
Did you check 2 cases with and without BN fusing before QAT? whether BN fusing affect on performance (speed) of final engine model?
@aidevmin ,
Could you elaborate? BN is fused into Conv, the pattern looks like DQ -> Conv -> BN -> ...
@ttyio Thanks.
The 2.1.3
release (https://github.com/NVIDIA/TensorRT/tree/release/8.6/tools/pytorch-quantization) already include the optimization in QAT tool, we will also update the wheels in pypi in next monthly release.
Closing this issue, thanks!
@ttyio Thanks.
I followed this guide https://github.com/NVIDIA-AI-IOT/yolo_deepstream/tree/main/yolov7_qat
I tried to do QAT with multiple GPUs with
torch.nn.DataParallel
, but I got an errorbatch_size
for QAT is low, so that I want to run with multiple GPUs.