How to perform inference for YOLO-NAS-POSE with TensorRT and display the detections?

vladb99 commented 10 months ago

💡 Your Question

I've successfully exported Yolo-Pose-NAS-N to .onnx model and then built a .trt engine. The .onnx was exported with FP32 quantization. I'm now trying to do inference of an image with the .trt engine. The code I'm testing is below, which is oriented to the code from issue #1451. After I do the inference, I pass the predictions to PoseVisualization.draw_poses. When I pass the predictions using another backend like OpenVino, PoseVisualization.draw_poses just draws the predictions. However, when I use the TensorRT backend, this method fails. The error is also below. Also when I print the num_predictions from the inference I get [[1056964608]], which is obviously wrong. Why does the inference return such wrong values?

error                                     Traceback (most recent call last)
Cell In[6], line 2
      1 init_tensorrt()
----> 2 detect()

Cell In[4], line 70, in detect()
     66 nms_joints = bindings['graph2_post_nms_joints'].data.cpu().numpy()
     68 results = [num_predictions, nms_boxes, nms_scores, nms_joints]
---> 70 img = get_predictions_from_batch_format(image, results)
     71 img = Image.fromarray(img, mode='RGB')
     72 img.save('myimage.png')

Cell In[3], line 16, in get_predictions_from_batch_format(image, predictions)
     12 def get_predictions_from_batch_format(image, predictions):
     13     # In this tutorial we are using batch size of 1, therefore we are getting only first element of the predictions
     14     image_index, pred_boxes, pred_scores, pred_joints = next(iter(iterate_over_batch_predictions(predictions, 1)))
---> 16     image = PoseVisualization.draw_poses(
     17         image=image, poses=pred_joints, scores=pred_scores, boxes=pred_boxes,
     18         edge_links=None, edge_colors=None, keypoint_colors=None, is_crowd=None
     19     )
     21     return image

File [d:\git\Yolov7\yoloNAS-pose\ENV\lib\site-packages\super_gradients\training\utils\visualization\pose_estimation.py:195](file:///D:/git/Yolov7/yoloNAS-pose/ENV/lib/site-packages/super_gradients/training/utils/visualization/pose_estimation.py:195), in PoseVisualization.draw_poses(self, image, poses, boxes, scores, is_crowd, edge_links, edge_colors, keypoint_colors, show_keypoint_confidence, joint_thickness, box_thickness, keypoint_radius, keypoint_confidence_threshold)
    192         if is_crowd is not None:
...
>  - Can't parse 'pt1'. Sequence item with index 0 has a wrong type
>  - Can't parse 'pt1'. Sequence item with index 0 has a wrong type
>  - Can't parse 'rec'. Expected sequence length 4, got 2
>  - Can't parse 'rec'. Expected sequence length 4, got 2

The export code

model = models.get(Models.YOLO_NAS_POSE_N, pretrained_weights="coco_pose")
export_result = model.export("yolo_nas_pose_n_fp32.onnx")
export_result

Converting .onnx to .trt

D:\TensorRT-8.6.1.6.Windows10.x86_64.cuda-11.8\TensorRT-8.6.1.6\bin\trtexec --onnx=yolo_nas_pose_n_fp32.onnx  --saveEngine=yolo_nas_pose_n_fp32_engine.trt

When I convert to a TensorRT engine I get:

[12/07/2023-13:59:05] [W] [TRT] onnx2trt_utils.cpp:374: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.

I thought when I export my model, the weights would be in FP32 format? I am confused.

The inference code:

import tensorrt as trt
import numpy as np
import os
import torch
import pycuda.driver as cuda
import pycuda.autoinit
import matplotlib.pyplot as plt
from PIL import Image
import cv2
from collections import namedtuple, OrderedDict
import time
from super_gradients.training.utils.visualization.pose_estimation import PoseVisualization

def iterate_over_batch_predictions(predictions, batch_size):
    num_detections, batch_boxes, batch_scores, batch_joints = predictions
    for image_index in range(batch_size):
        num_detection_in_image = num_detections[image_index, 0]

        pred_scores = batch_scores[image_index, :num_detection_in_image]
        pred_boxes = batch_boxes[image_index, :num_detection_in_image]
        pred_joints = batch_joints[image_index, :num_detection_in_image].reshape((len(pred_scores), -1, 3))

        yield image_index, pred_boxes, pred_scores, pred_joints

def get_predictions_from_batch_format(image, predictions):
    # In this tutorial we are using batch size of 1, therefore we are getting only first element of the predictions
    image_index, pred_boxes, pred_scores, pred_joints = next(iter(iterate_over_batch_predictions(predictions, 1)))

    image = PoseVisualization.draw_poses(
        image=image, poses=pred_joints, scores=pred_scores, boxes=pred_boxes,
        edge_links=None, edge_colors=None, keypoint_colors=None, is_crowd=None
    )

    return image

w = 'yolo_nas_pose_n_fp32_engine.trt'
device = torch.device('cuda:0')

bindings = None
binding_addrs = None
context = None

# Infer TensorRT Engine
def init_tensorrt():
    global bindings
    global binding_addrs
    global context

    Binding = namedtuple('Binding', ('name', 'dtype', 'shape', 'data', 'ptr'))
    logger = trt.Logger(trt.Logger.INFO)
    trt.init_libnvinfer_plugins(logger, namespace="")
    with open(w, 'rb') as f, trt.Runtime(logger) as runtime:
        model = runtime.deserialize_cuda_engine(f.read())
    bindings = OrderedDict()
    for index in range(model.num_bindings):
        name = model.get_tensor_name(index)
        dtype = trt.nptype(model.get_tensor_dtype(name))
        shape = tuple(model.get_tensor_shape(name))
        data = torch.from_numpy(np.empty(shape, dtype=np.dtype(dtype))).to(device)
        bindings[name] = Binding(name, dtype, shape, data, int(data.data_ptr()))
    binding_addrs = OrderedDict((n, d.ptr) for n, d in bindings.items())
    context = model.create_execution_context()

        # warmup for 10 times
    for _ in range(10):
        tmp = torch.randn(1,3,640,640).to(device)
        binding_addrs['onnx::Cast_0'] = int(tmp.data_ptr())
        context.execute_v2(list(binding_addrs.values()))

def detect():

    global bindings
    global binding_addrs
    global context
    global videocap

    #img = videocap.read_in()
    img = Image.open("test.png")
    img = np.asarray(img)[:, :, :3]

    img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
    image = img.copy()
    image = image.transpose((2, 0, 1))
    image = np.expand_dims(image, 0)

    im = image.astype(np.uint8)
    im = torch.from_numpy(im).to(device)

    start = time.perf_counter()
    binding_addrs['onnx::Cast_0'] = int(im.data_ptr())
    context.execute_v2(list(binding_addrs.values()))
    exec_cost = time.perf_counter()-start

    num_predictions = bindings['graph2_num_predictions'].data.cpu().numpy()
    nms_boxes = bindings['graph2_post_nms_boxes'].data.cpu().numpy()
    nms_scores = bindings['graph2_post_nms_scores'].data.cpu().numpy()
    nms_joints = bindings['graph2_post_nms_joints'].data.cpu().numpy()

    results = [num_predictions, nms_boxes, nms_scores, nms_joints]

    img = get_predictions_from_batch_format(image, results)
    img = Image.fromarray(img, mode='RGB')
    img.save('myimage.png')

init_tensorrt()
detect()

Versions

Collecting environment information... PyTorch version: 2.1.1+cu118 Is debug build: False CUDA used to build PyTorch: 11.8 ROCM used to build PyTorch: N/A

OS: Microsoft Windows 10 Enterprise GCC version: (x86_64-posix-seh, Built by strawberryperl.com project) 8.3.0 Clang version: Could not collect CMake version: version 3.25.1 Libc version: N/A

Python version: 3.10.11 (tags/v3.10.11:7d4cc5a, Apr 5 2023, 00:38:17) [MSC v.1929 64 bit (AMD64)] (64-bit runtime) Python platform: Windows-10-10.0.19045-SP0 Is CUDA available: True CUDA runtime version: 11.8.89 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: Quadro P2000 Nvidia driver version: 537.42 cuDNN version: C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.8\bin\cudnn_ops_train64_8.dll HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True

CPU: Architecture=9 CurrentClockSpeed=2592 DeviceID=CPU0 Family=198 L2CacheSize=1536 L2CacheSpeed= Manufacturer=GenuineIntel MaxClockSpeed=2592 Name=Intel(R) Core(TM) i7-8850H CPU @ 2.60GHz ProcessorType=3 Revision=

Versions of relevant libraries: [pip3] numpy==1.23.0 [pip3] onnx==1.13.0 [pip3] onnx-graphsurgeon==0.3.27 [pip3] onnx-simplifier==0.4.35 [pip3] onnxruntime==1.13.1 [pip3] torch==2.1.1+cu118 [pip3] torchaudio==2.1.1+cu118 [pip3] torchmetrics==0.8.0 [pip3] torchvision==0.16.1+cu118 [conda] Could not collect

BloodAxe commented 10 months ago

Hi @vladb99 What SG version do you have?

[12/07/2023-13:59:05] [W] [TRT] onnx2trt_utils.cpp:374: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.

This most likely corresponds to some arange operator that outputs numbers as long in converting box regression offsets to absolute pixel units. I believe this should have been addressed recently.

Can't parse 'pt1'. Sequence item with index 0 has a wrong type

Can't parse 'pt1'. Sequence item with index 0 has a wrong type

Can't parse 'rec'. Expected sequence length 4, got 2

Can't parse 'rec'. Expected sequence length 4, got 2

These errors indicate that pt1 and pt2 has wrong dtype. OpenCV can draw rectangles when coordinates given as (int,int). So you may want to cast them explicitly to int. Probably you are using old release of SG where PoseVisualization.draw_poses hasn't been updated to do this casting for you.

So my suggestions are to take latest build of SG and try with it.

Anyway, we are going to make an official Colab demo showing how to run the our models using TRT. Stay tuned for that.

vladb99 commented 10 months ago

What SG version do you have?

super-gradients 3.5.0

These errors indicate that pt1 and pt2 has wrong dtype. OpenCV can draw rectangles when >coordinates given as (int,int). So you may want to cast them explicitly to int. Probably you are using old >release of SG where PoseVisualization.draw_poses hasn't been updated to do this casting for you.

I don't think this is the case. I exported an .onnx model using your tutorial and then did inference with ONNXRuntime and OpenVino. Their output was identical. When doing inference using the TensorRT engine I created using the exported .onnx model, I get something totally different. I would've expected to be similar.

Anyway, we are going to make an official Colab demo showing how to run the our models using TRT. Stay tuned for that.

That would be great! Please don't just show running the inference with the TRT engine, but also use PoseVisualization.draw_poses to draw the pose. I ask for this, because I'm able to execute it also, but the output is just wrong and PoseVisualization.draw_poses fails for that very reason.

Deci-AI / super-gradients