Deci-AI / super-gradients

Easily train or fine-tune SOTA computer vision models with one open source training library. The home of Yolo-NAS.
https://www.supergradients.com
Apache License 2.0
4.58k stars 509 forks source link

TensorRT inference result is abnormal #1700

Closed hslee4716 closed 11 months ago

hslee4716 commented 11 months ago

💡 Your Question

Im successfully train yolo_nas_m model in custom dataset with qat, and export model with int8-onnx, The onnx inference results are normal. but, when i export onnx to tensorrt engine, the results is abnormal.(nothing detected)

Here's my codes, is there any problem?

from super_gradients.common.object_names import Models
from super_gradients.training import models
from super_gradients.conversion import DetectionOutputFormatMode, ExportQuantizationMode, ExportTargetBackend
from super_gradients.training.utils.quantization.selective_quantization_utils import SelectiveQuantizer
from super_gradients.modules.repvgg_block import fuse_repvgg_blocks_residual_branches
from super_gradients.common.environment.cfg_utils import load_recipe
from super_gradients.training.utils import get_param

import torch

def load_checkpoint(model, ckpt_file):
  checkpoint = torch.load(ckpt_file, map_location="cpu")
  ckpt_key = "ema_net" if "ema_net" in checkpoint else "net"
  state_dict = checkpoint[ckpt_key]
  model.load_state_dict(state_dict)

def to_int8_model(model):
    fuse_repvgg_blocks_residual_branches(model)        
    quantization_params = load_recipe("quantization_params/default_quantization_params").quantization_params

    selective_quantizer_params = get_param(quantization_params, "selective_quantizer_params")
    q_util = SelectiveQuantizer(
        default_quant_modules_calibrator_weights=get_param(selective_quantizer_params, "calibrator_w"),
        default_quant_modules_calibrator_inputs=get_param(selective_quantizer_params, "calibrator_i"),
        default_per_channel_quant_weights=get_param(selective_quantizer_params, "per_channel"),
        default_learn_amax=get_param(selective_quantizer_params, "learn_amax"),
    )
    q_util.register_skip_quantization(layer_names=get_param(selective_quantizer_params, "skip_modules"))
    q_util.quantize_module(model)

model = models.get(
    Models.YOLO_NAS_M, 
    num_classes=2,
    checkpoint_num_classes=2,
    pretrained_weights="coco"
)
model.eval()

to_int8_model(model)

load_checkpoint(model, "exp/yolo_nas_M_custom_qat_1212/RUN_20231211_121606_029174/ckpt_best.pth")

model.export("./yolonas_m_1212_test.onnx", 
             engine=ExportTargetBackend.TENSORRT, # onnxruntime
             output_predictions_format=DetectionOutputFormatMode.FLAT_FORMAT, 
             postprocessing=False,
             preprocessing=True,
             quantization_mode=ExportQuantizationMode.INT8
             )
$ trtexec --fp16 --onnx=./yolonas_m_1212_test.onnx --saveEngine=./yolonas_m_1212_test.engine

Versions

PyTorch version: 2.0.1+cu117
Is debug build: False
CUDA used to build PyTorch: 11.7
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.6 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
Clang version: Could not collect
CMake version: version 3.27.9
Libc version: glibc-2.31

Python version: 3.9.18 (main, Sep 11 2023, 13:41:44)  [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.15.0-89-generic-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: 11.4.315
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3080 Ti
Nvidia driver version: 535.129.03
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.5
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.5
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.5
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.5
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.5
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.5
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.5
/usr/local/cuda-11.4/targets/x86_64-linux/lib/libcudnn.so.8.9.5
/usr/local/cuda-11.4/targets/x86_64-linux/lib/libcudnn_adv_infer.so.8.9.5
/usr/local/cuda-11.4/targets/x86_64-linux/lib/libcudnn_adv_train.so.8.9.5
/usr/local/cuda-11.4/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8.9.5
/usr/local/cuda-11.4/targets/x86_64-linux/lib/libcudnn_cnn_train.so.8.9.5
/usr/local/cuda-11.4/targets/x86_64-linux/lib/libcudnn_ops_infer.so.8.9.5
/usr/local/cuda-11.4/targets/x86_64-linux/lib/libcudnn_ops_train.so.8.9.5
/usr/local/cuda-12.2/targets/x86_64-linux/lib/libcudnn.so.8.9.4
/usr/local/cuda-12.2/targets/x86_64-linux/lib/libcudnn_adv_infer.so.8.9.4
/usr/local/cuda-12.2/targets/x86_64-linux/lib/libcudnn_adv_train.so.8.9.4
/usr/local/cuda-12.2/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8.9.4
/usr/local/cuda-12.2/targets/x86_64-linux/lib/libcudnn_cnn_train.so.8.9.4
/usr/local/cuda-12.2/targets/x86_64-linux/lib/libcudnn_ops_infer.so.8.9.4
/usr/local/cuda-12.2/targets/x86_64-linux/lib/libcudnn_ops_train.so.8.9.4
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Byte Order:                         Little Endian
Address sizes:                      39 bits physical, 48 bits virtual
CPU(s):                             16
On-line CPU(s) list:                0-15
Thread(s) per core:                 2
Core(s) per socket:                 8
Socket(s):                          1
NUMA node(s):                       1
Vendor ID:                          GenuineIntel
CPU family:                         6
Model:                              165
Model name:                         Intel(R) Core(TM) i7-10700KF CPU @ 3.80GHz
Stepping:                           5
CPU MHz:                            3800.000
CPU max MHz:                        5100.0000
CPU min MHz:                        800.0000
BogoMIPS:                           7599.80
Virtualization:                     VT-x
L1d cache:                          256 KiB
L1i cache:                          256 KiB
L2 cache:                           2 MiB
L3 cache:                           16 MiB
NUMA node0 CPU(s):                  0-15
Vulnerability Gather data sampling: Mitigation; Microcode
Vulnerability Itlb multihit:        KVM: Mitigation: VMX disabled
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Mitigation; Clear CPU buffers; SMT vulnerable
Vulnerability Retbleed:             Mitigation; Enhanced IBRS
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
Vulnerability Srbds:                Mitigation; Microcode
Vulnerability Tsx async abort:      Not affected
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp pku ospke md_clear flush_l1d arch_capabilities

Versions of relevant libraries:
[pip3] numpy==1.23.0
[pip3] onnx==1.13.0
[pip3] onnx-graphsurgeon==0.3.27
[pip3] onnx-simplifier==0.4.35
[pip3] onnxruntime-gpu==1.16.3
[pip3] pytorch-quantization==2.1.2
[pip3] torch==2.0.1
[pip3] torchaudio==2.0.2
[pip3] torchmetrics==0.8.0
[pip3] torchvision==0.15.2
[pip3] triton==2.0.0
[conda] numpy                     1.23.0                   pypi_0    pypi
[conda] pytorch-quantization      2.1.3                    pypi_0    pypi
[conda] torch                     2.0.1                    pypi_0    pypi
[conda] torchaudio                2.0.2                    pypi_0    pypi
[conda] torchmetrics              0.8.0                    pypi_0    pypi
[conda] torchvision               0.15.2                   pypi_0    pypi
[conda] triton                    2.0.0                    pypi_0    pypi

tensorboard                   2.15.1
tensorboard-data-server       0.7.2
tensorrt                      8.6.1
tensorrt-bindings             8.6.1
tensorrt-libs                 8.6.1
super-gradients               3.6.0rc32009        [local]
BloodAxe commented 11 months ago

Hi @hslee4716 . I think the code here is wrong. Specifically the order of calls seems to be incorrect:

to_int8_model(model)
load_checkpoint(model, "exp/yolo_nas_M_custom_qat_1212/RUN_20231211_121606_029174/ckpt_best.pth")

You cannot load regular torch weights into quantized model. It messes up model completely. There is no simple way to save quantized model state as you would normally do. After you quantized a model there is only one way to store it - export to ONNX file.

Please check example notebook where we show to do fine-tune and export YoloNAS model end-to-end: https://github.com/Deci-AI/super-gradients/blob/master/notebooks/yolo_nas_custom_dataset_fine_tuning_with_qat.ipynb

hslee4716 commented 11 months ago

Thanks for your reply @BloodAxe , But, when i trying to do QAT for YoloNas model according to the code below, only quantized weights are saved. Therefore, i can load weights only after model has been quantized

from super_gradients.training.datasets.detection_datasets.coco_format_detection import COCOFormatDetectionDataset
from super_gradients.training.transforms.transforms import (
    DetectionMosaic,
    DetectionRandomAffine,
    DetectionHSV,
    DetectionHorizontalFlip,
    DetectionPaddedRescale,
    DetectionStandardize,
    DetectionTargetsFormatTransform,
)
from super_gradients.training.datasets.datasets_utils import worker_init_reset_seed

from super_gradients.training import Trainer
from super_gradients.common.object_names import Models
from super_gradients.training import models

from torch.utils.data import DataLoader
from super_gradients.training import Trainer
from super_gradients.common.object_names import Models
from super_gradients.training import models
from super_gradients.training.losses import PPYoloELoss
from super_gradients.training.metrics import DetectionMetrics_050
from super_gradients.training.models.detection_models.pp_yolo_e import PPYoloEPostPredictionCallback
from super_gradients.training.utils.detection_utils import DetectionCollateFN
from super_gradients.training.pre_launch_callbacks import modify_params_for_qat

import warnings
warnings.filterwarnings('ignore')

input_size = (1280,1280)
batch_size = 2
num_workers = 12
train_dataset_params = dict(
    data_dir="/datasets/ver5",
    images_dir="/datasets/ver5/images/train",
    json_annotation_file="/datasets/ver5/annotations/train.json",
    input_dim=input_size,
    ignore_empty_annotations=False,
    with_crowd=False,
    all_classes_list=['person', 'car'],
    transforms=[
        DetectionMosaic(prob=1., input_dim=input_size),
        DetectionRandomAffine(degrees=0.0, scales=(0.5, 1.5), shear=0.0, target_size=input_size, filter_box_candidates=False, border_value=128),
        DetectionHSV(prob=1.0, hgain=5, vgain=30, sgain=30),
        DetectionHorizontalFlip(prob=0.5),
        DetectionPaddedRescale(input_dim=input_size),
        DetectionStandardize(max_value=255),
        DetectionTargetsFormatTransform(input_dim=input_size, output_format="LABEL_CXCYWH"),
    ],
)

val_dataset_params = dict(
    data_dir=/datasets/ver5",
    images_dir="/datasets/ver_5/images/val",
    json_annotation_file="/datasets/ver5/annotations/val.json",
    input_dim=input_size,
    ignore_empty_annotations=False,
    with_crowd=False,
    all_classes_list=['person', 'car'],
    transforms=[
        DetectionPaddedRescale(input_dim=input_size, max_targets=300),
        DetectionStandardize(max_value=255),
        DetectionTargetsFormatTransform(input_dim=input_size, output_format="LABEL_CXCYWH"),
    ],
)

train_dataloader_params = {
    "shuffle": True,
    "batch_size": batch_size,
    "drop_last": True,
    "pin_memory": True,
    "collate_fn": DetectionCollateFN(),
    "worker_init_fn": worker_init_reset_seed,
    "num_workers": num_workers,
    "persistent_workers": True,
}

val_dataloader_params = {
    "shuffle": False,
    "batch_size": batch_size,
    "drop_last": False,
    "pin_memory": True,
    "collate_fn": DetectionCollateFN(),
    "worker_init_fn": worker_init_reset_seed,
    "num_workers": num_workers,
    "persistent_workers": True,
}

train_params = {
    "warmup_initial_lr": 1e-6,
    "initial_lr": 5e-4,
    "lr_mode": "cosine",
    "cosine_final_lr_ratio": 0.1,
    "optimizer": "AdamW",
    "zero_weight_decay_on_bias_and_bn": True,
    "lr_warmup_epochs": 3,
    "warmup_mode": "LinearEpochLRWarmup",
    "optimizer_params": {"weight_decay": 0.0001},
    "ema": True,
    "ema_params": {"beta": 25, "decay_type": "exp"},
    "max_epochs": 300,
    "mixed_precision": True,
    "loss": PPYoloELoss(use_static_assigner=False, num_classes=2, reg_max=16),
    "valid_metrics_list": [
        DetectionMetrics_050(
            score_thres=0.1,
            top_k_predictions=300,
            num_cls=2,
            normalize_targets=True,
            include_classwise_ap=True,
            class_names=['person', 'car'],
            post_prediction_callback=PPYoloEPostPredictionCallback(score_threshold=0.01, nms_top_k=1000, max_predictions=300, nms_threshold=0.7),
        )
        ],
    "metric_to_watch": "mAP@0.50",

}

train_params, train_dataset_params, val_dataset_params, train_dataloader_params, val_dataloader_params = modify_params_for_qat(
    train_params, train_dataset_params, val_dataset_params, train_dataloader_params, val_dataloader_params
)

trainset = COCOFormatDetectionDataset(**train_dataset_params)
valset = COCOFormatDetectionDataset(**val_dataset_params)

train_loader = DataLoader(trainset, **train_dataloader_params)
valid_loader = DataLoader(valset, **val_dataloader_params)

trainer = Trainer(experiment_name="yolo_nas_M_custom_qat", ckpt_root_dir="experiments")
model = models.get(Models.YOLO_NAS_M, num_classes=2, pretrained_weights="coco")
model.cuda()

trainer.qat(model=model, training_params=train_params, 
            train_loader=train_loader, valid_loader=valid_loader, calib_loader=train_loader)

When I try to load the model as reference code,

model = models.get(
    Models.YOLO_NAS_M, 
    num_classes=2,
    checkpoint_num_classes=2,
    checkpoint_path="weights/best.pth"
)

The error below occurs

/mnt/nas/super-gradients/src/super_gradients/training/utils/checkpoint_utils.py", line 212, in __call__
    raise ValueError(f"ckpt layer {ckpt_key} with shape {ckpt_val.shape} does not match {model_key}" f" with shape {model_val.shape} in the model")
ValueError: ckpt layer backbone.stem.conv.post_bn.weight with shape torch.Size([48]) does not match backbone.stem.conv.branch_3x3.conv.weight with shape torch.Size([48, 3, 3, 3]) in the model

And the keys of the loaded weights are as follows.

backbone.stem.conv.post_bn.weight
backbone.stem.conv.post_bn.bias
backbone.stem.conv.post_bn.running_mean
backbone.stem.conv.post_bn.running_var
backbone.stem.conv.post_bn.num_batches_tracked
backbone.stem.conv.rbr_reparam.weight
backbone.stem.conv.rbr_reparam.bias
backbone.stem.conv.rbr_reparam._input_quantizer._amax
backbone.stem.conv.rbr_reparam._weight_quantizer._amax
backbone.stage1.downsample.post_bn.weight
backbone.stage1.downsample.post_bn.bias
backbone.stage1.downsample.post_bn.running_mean
backbone.stage1.downsample.post_bn.running_var
backbone.stage1.downsample.post_bn.num_batches_tracked
backbone.stage1.downsample.rbr_reparam.weight
backbone.stage1.downsample.rbr_reparam.bias
backbone.stage1.downsample.rbr_reparam._input_quantizer._amax
.
.
.
.
heads.head3.reg_convs.0.seq.conv._weight_quantizer._amax
heads.head3.reg_convs.0.seq.bn.weight
heads.head3.reg_convs.0.seq.bn.bias
heads.head3.reg_convs.0.seq.bn.running_mean
heads.head3.reg_convs.0.seq.bn.running_var
heads.head3.reg_convs.0.seq.bn.num_batches_tracked
heads.head3.cls_pred.weight
heads.head3.cls_pred.bias
heads.head3.cls_pred._input_quantizer._amax
heads.head3.cls_pred._weight_quantizer._amax
heads.head3.reg_pred.weight
heads.head3.reg_pred.bias
heads.head3.reg_pred._input_quantizer._amax
heads.head3.reg_pred._weight_quantizer._amax
BloodAxe commented 11 months ago

Sorry for the confusion, there are too many methods and not all of the have been updated to recent model.export API.

Let's start with model.export itself.

This is a new and recommended way to export a model to ONNX that gives you the most control on how you want to export a model. In fact you can specify that you want to export INT8 quantized model. In this you don't have to do PTQ manually. A model.export(..., quantization_mode=ExportQuantizationMode.INT8) is enough. You just pass the regular trained model with FP32 weights and it will be quantized inside. Yes, you're not getting QAT here, but you still can pass dataloader for model calibration if you like.

So in short:

trainer.train(model=my_model, ...)
my_model.export("my_model.onnx", quantization_mode=ExportQuantizationMode.INT8, postprocessing=True)

Please note, my_model.export will NOT modify the instance of my_model object. This instance will be intact, with all the weights and model state kept as before the call. This is done on-purpose to avoid unwanted side-effects.

2) Second option is to use trainer.ptq. This method does similar job, it quantize and export a model. If the model supports a new export API it will use it, otherwise it will export model using old export routines (Relevant of classification & segmentation models) The method will change the model state. It will save the exported ONNX file to experiment directory and compute metrics on the validation dataloader for PTQ model. You don't get much of the control on the exported ONNX when using this method - you can't specify export format (BATCH/FLAT) or tune postprocessing.

1) trainer.qat performs PTQ (post-training quantization) and QAT (quantization aware training) AND model export to ONNX. This method currently don't support use of model.export and thus exported model in this case will not have postprocessing step. It will also change the model state but it allows you to 'tune' a model and slightly increase the accuracy of the exported model.

After calling trainer.ptq or trainer.qat you should be able to call my_model.export("my_model.onnx", quantization_mode=ExportQuantizationMode.INT8, postprocessing=True) and set whatever postprocessing you like. Just not try to load some weights there, after QAT the model already has right state for export.