NVIDIA / TensorRT

NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.
https://developer.nvidia.com/tensorrt
Apache License 2.0
10.55k stars 2.1k forks source link

run int8 model failure of TensorRT 8.4.12 when running yolo on orin DLA #3799

Open mayulin0206 opened 5 months ago

mayulin0206 commented 5 months ago

Description

For the quantized INT8 model, the inference results are correct under Orin DLA FP16, and the results are also correct under Orin GPU INT8, but the results are completely incorrect under Orin DLA INT8.

Environment

TensorRT Version :8.4.12

NVIDIA GPU:

NVIDIA Driver Version:

CUDA Version:

CUDNN Version:

Operating System:

Python Version (if applicable):

Tensorflow Version (if applicable):

PyTorch Version (if applicable):

Baremetal or Container (if so, version):

zerollzeng commented 4 months ago
  1. Could you please try the latest DriveOS/JP release?
  2. We have a yolov5 dla sample: https://github.com/NVIDIA-AI-IOT/cuDLA-samples maybe is helpful to you.
  3. Please provide a minimal reproduce if the latest release still fail.
lix19937 commented 4 months ago

For the quantized INT8 model, the inference results are correct under Orin DLA FP16, and the results are also correct under Orin GPU INT8, but the results are completely incorrect under Orin DLA INT8.

You should do QAT2PTQ, get scales of qat onnx and then save as a calib table to run int8 of dla.

mayulin0206 commented 4 months ago

For the quantized INT8 model, the inference results are correct under Orin DLA FP16, and the results are also correct under Orin GPU INT8, but the results are completely incorrect under Orin DLA INT8.

You should do QAT2PTQ, get scales of qat onnx and then save as a calib table to run int8 of dla.

yes, I did this, but the result is still completely wrong. the inference results are correct under Orin GPU INT8, but the results are completely incorrect under Orin DLA INT8

mayulin0206 commented 4 months ago
  1. Could you please try the latest DriveOS/JP release?
  2. We have a yolov5 dla sample: https://github.com/NVIDIA-AI-IOT/cuDLA-samples maybe is helpful to you.
  3. Please provide a minimal reproduce if the latest release still fail.

@zerollzeng According to your advice, I ran the yolov5 dla sample(https://github.com/NVIDIA-AI-IOT/cuDLA-samples) on the Orin DLA, but encountered the following issue as shown in below. image

make run

/usr/local/cuda//bin/nvcc -I /usr/local/cuda//include -I ./src/matx_reformat/ -I /usr/include/opencv4/ -I /usr/include/jsoncpp/ -I /usr/include -gencode arch=compute_87,code=sm_87 -c -o build/decode_nms.o src/decode_nms.cu g++ -I /usr/local/cuda//include -I ./src/matx_reformat/ -I /usr/include/opencv4/ -I /usr/include/jsoncpp/ -I /usr/include --std=c++14 -Wno-deprecated-declarations -Wall -O2 -c -o build/validate_coco.o src/validate_coco.cpp g++ -I /usr/local/cuda//include -I ./src/matx_reformat/ -I /usr/include/opencv4/ -I /usr/include/jsoncpp/ -I /usr/include --std=c++14 -Wno-deprecated-declarations -Wall -O2 -c -o build/yolov5.o src/yolov5.cpp g++ -I /usr/local/cuda//include -I ./src/matx_reformat/ -I /usr/include/opencv4/ -I /usr/include/jsoncpp/ -I /usr/include --std=c++14 -Wno-deprecated-declarations -Wall -O2 -c -o build/cudla_context_hybrid.o src/cudla_context_hybrid.cpp g++ --std=c++14 -Wno-deprecated-declarations -Wall -O2 -I /usr/local/cuda//include -I ./src/matx_reformat/ -I /usr/include/opencv4/ -I /usr/include/jsoncpp/ -I /usr/include -o ./build/cudla_yolov5_app build/decode_nms.o build/validate_coco.o build/yolov5.o build/cudla_context_hybrid.o -l cudla -L/usr/local/cuda//lib64 -l cuda -l cudart -l nvinfer -L /usr/lib/aarch64-linux-gnu/ -l opencv_objdetect -l opencv_highgui -l opencv_imgproc -l opencv_core -l opencv_imgcodecs -L ./src/matx_reformat/build/ -l matx_reformat -l jsoncpp -lnvscibuf -lnvscisync ././build/cudla_yolov5_app --engine ./data/loadable/yolov5.int8.int8hwc4in.fp16chw16out.standalone.bin --image ./data/images/image.jpg --backend cudla_int8 [hybrid mode] create cuDLA device SUCCESS [hybrid mode] load cuDLA module from memory FAILED in src/cudla_context_hybrid.cpp:96, CUDLA ERR: 7 make: *** [Makefile:80: run] Error 1

Build INT8 and FP16 loadable from ONNX in this project

bash data/model/build_dla_standalone_loadable.sh

[04/22/2024-19:51:27] [E] Error[3]: [builderConfig.cpp::setFlag::65] Error Code 3: API Usage Error (Parameter check failed at: optimizer/api/builderConfig.cpp::setFlag::65, condition: builderFlag != BuilderFlag::kPREFER_PRECISION_CONSTRAINTS || !flags[BuilderFlag::kOBEY_PRECISION_CONSTRAINTS]. kPREFER_PRECISION_CONSTRAINTS cannot be set if kOBEY_PRECISION_CONSTRAINTS is set. ) [04/22/2024-19:51:27] [E] Error[2]: [nvmRegionOptimizer.cpp::forceToUseNvmIO::175] Error Code 2: Internal Error (Assertion std::all_of(a->consumers.begin(), a->consumers.end(), [](Node* n) { return isDLA(n->backend); }) failed. ) [04/22/2024-19:51:27] [E] Error[2]: [builder.cpp::buildSerializedNetwork::636] Error Code 2: Internal Error (Assertion engine != nullptr failed. ) [04/22/2024-19:51:27] [E] Engine could not be created from network [04/22/2024-19:51:27] [E] Building engine failed [04/22/2024-19:51:27] [E] Failed to create engine from model or file. [04/22/2024-19:51:27] [E] Engine set up failed &&&& FAILED TensorRT.trtexec [TensorRT v8401] # /usr/src/tensorrt/bin/trtexec --minShapes=images:1x3x672x672 --maxShapes=images:1x3x672x672 --optShapes=images:1x3x672x672 --shapes=images:1x3x672x672 --onnx=data/model/yolov5_trimmed_qat.onnx --useDLACore=0 --buildDLAStandalone --saveEngine=data/loadable/yolov5.int8.int8hwc4in.fp16chw16out.standalone.bin --inputIOFormats=int8:dla_hwc4 --outputIOFormats=fp16:chw16 --int8 --fp16 --calib=data/model/qat2ptq.cache --precisionConstraints=obey --layerPrecisions=/model.24/m.0/Conv:fp16,/model.24/m.1/Conv:fp16,/model.24/m.2/Conv:fp16

mayulin0206 commented 4 months ago

@zerollzeng @lix19937 I also have another questions about DLA. Under the DLA INT8 mode,

  1. Is the default tensor format for computation kDLA_HWC4?
  2. Since the tensor format for computation on my GPU is kLINEAR, is a format conversion necessary under the DLA INT8 mode?
  3. If the default tensor format for computation under the DLA INT8 mode is kDLA_HWC4, and some layers in the model fall back to the GPU, will there be an automatic format conversion for the computations that fall back to the GPU, and will it automatically convert to kLINEAR?