Huxwell commented 7 months ago

Description

I have converted detectron2/configs/Base-RCNN-FPN.yaml, model_final_f10217.pkl to onnx using https://github.com/facebookresearch/detectron2/blob/main/tools/deploy/export_model.py . The resulting model works well (is not blind) when run with onnxruntime .

Then I've run https://github.com/NVIDIA/TensorRT/blob/release/10.0/samples/python/detectron2/create_onnx.py resulting onnx cannot be run with onnxruntime anymore with Unsupported model IR version: 10, max supported IR version: 9 error. I failed to convert it to ir_version: 9 (10 is not yet published by onnx?).

Then, I've managed to successfully convert the model to trt using nvcr.io/nvidia/pytorch:22.08-py3 with TensorRT 8.4.2-1. trtexec --onnx=mask_rcnn_R_50_FPN_3x_f10217.onnx --saveEngine=mask_rcnn_R_50_FPN_3x_f10217r.trt --useCudaGraph The resulting model improves latency from 4fps in detectron2 (T600 gpu) to 7fps (fp32) or avg 10fps (fp16) using https://github.com/NVIDIA/TensorRT/blob/release/10.0/samples/python/detectron2/infer.py (timing only inference itself, visualization or post-processing seems very slow).

But, I can see almost no detections (same image I used for conversion, 1344x1344, fp32): before, with detectron2:

Environment

TensorRT Version: 8.4.2-1

NVIDIA GPU: T600

NVIDIA Driver Version: 535.161.07

CUDA Version: 12.2

CUDNN Version: n/a

Operating System:

Python Version (if applicable): 3.8.10 (host), 3.8.13 (docker)

PyTorch Version (if applicable): 2.0.1

Baremetal or Container (if so, version): nvcr.io/nvidia/pytorch:22.08-py3 , PC (for now)

@azhurkevich , can you please advise where to look ?

azhurkevich commented 7 months ago

@Huxwell it is most likely a problem of changed graph while exported with detectron2 exporter. I would suggest reverting back 1+ year ago and try again. It is not easily possible to write a dynamic pattern matcher for plugins that are used in this sample to replace some ONNX graph sections with em. As a result if graph is changing and it is changing, naive pattern matcher that I have will be broken. This forces you to go and inspect manually ONNX graphs with netron and figure out what went wrong.

I am not maintaining this sample anymore, I'd love to but just no time... @zerollzeng can you talk to some folks internally and see if we can write some automatic testing and verification that sample works? Otherwise folks are getting upset, rightfully so.

P.S. Sample was never intended to work with ONNX runtime due to presence of TRT plugins. As a result, you have to build with trtexec or TRT python APIs and use it with TRT not relying on ONNXRT.

zerollzeng commented 7 months ago

@zerollzeng can you talk to some folks internally and see if we can write some automatic testing and verification that sample works? Otherwise folks are getting upset, rightfully so.

Checking internally.

zerollzeng commented 7 months ago

@Huxwell Have you run the sample with latest TRT 10? I want to know whether it works in our latest release, if not, I'll create an internal bug to track. Thanks!

Huxwell commented 7 months ago

@azhurkevich just to be sure, you are suggesting reverting detectron2 1year+ back (not onnx or TensorRT) ? @zerollzeng can I do it with a docker image provided by NVIDIA? https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch/tags Currently I am using nvcr.io/nvidia/pytorch:22.08-py3 , would moving to nvcr.io/nvidia/pytorch:24.03-py3 mean TRT 10, or should I attempt to build from source?

Also, @azhurkevich , this is slightly offtopic, our ultimate goal is to achieve good latency and accuracy on Jetson Orins for a modified Keypoint R-CNN from detectron2, as well as with regular object detection.

Mask RCNN conversion above functions mostly as a learning step / a way to estimate performance for us. Can you recommend alternative approaches to our current path (export_model.py->create_onnx.py->trtexec)? Do you know of an alternative conversion path or another (non-detectron2) keypoints/pose algorithm that works well on Jetson Orin / TensorRT?

azhurkevich commented 7 months ago

@Huxwell yeah, basically reverting to some old version, you can check our images and what they have here. I do not think you need to do some further investigations if you make that work (there were no major advancements keypoint, although some researchers might disagree with me ;) ). I gave some tips to another person who was asking a similar question about making keypoint network work. You can search through issues and dig it up. Make sure to reduce keypoint numbers to the amount you actually need, otherwise there is massive waste in computation.

About alternative path. Nothing from the top of my head, when I was doing investigations back 2 years ago. Detectron2 MRCNN was the fastest available, considering a lot of components are reused I would image Keypoint is also decent. Maybe there were further perf improvements since that time in some other places, but I would say stick to detectron2 to conserve your time. You will still have to use this graph surgery of ONNX with plugins business. I know it is painful, but custom plugins is what gives you perf, create_onnx.py script is a great resource to understand how things are done. I've answered and provided a lot of info related to this sample in various issues, you might want to search and read). I think TRT will not work without plugins if you just provide exported ONNX to TRT. I would encourage you to try though!

azhurkevich commented 7 months ago

@Huxwell reverting back critical components are detectron2, because this what most likely leads to this accuracy issues. Also, maybe stick to builder that I have in the sample instead of trtexec. As soon as you get your plugin patched ONNX, feel free to use any TRT version (TRT 10 for example).

zerollzeng commented 7 months ago

@Huxwell here are 2 things you might miss:

you converted model_final_f10217.pkl with detectron2/configs/Base-RCNN-FPN.yaml . In the model zoo, the correct config to use is detectron2/configs/COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x.yaml given in MODEL_ZOO.md.
In our sample, we have a special modification to the detectron2 export model script as shown in the readme, did you follow it?

Ask because the sample can be pass in our local test. Thanks!

FilipDrapejkowskiGL commented 7 months ago

Thank you for helping me out @zerollzeng. Exactly as you suggested, I was missing the 1344x1344 input 'augmentation' modification to the detectron2 export model script. Now even fp16 gives decent predictions: Please don't close this issue just yet, as I will compare the predicted probabilities in more detail, but I am on a good path.

Thanks a lot @azhurkevich,

I know it is painful, but custom plugins is what gives you perf, create_onnx.py script is a great resource to understand how things are done.

Any chance you could elaborate on this sentence? Do custom plugins serve only as a way to make trtexec able to parse the ONNX, or do they also speed up latency? Is there any other source of these customizations other than create_onnx.py (which I will learn in more detail) that you can recommend?

azhurkevich commented 7 months ago

@FilipDrapejkowskiGL PLug-ins are custom kernels that speed-up latency significantly. Small tip, for actual production use case you most likely care about high accuracy objects only. High NMS threshold supposed to get you more perf. It's one of the arguments in a sample.

Second role plug-ins plays is actual ability of TRT to run this particular network. If you will try to build engine with plain ONNX that exporter provides you, TRT simply will not be able to build this network. It is related to bloat that is introduced to ONNX by doing multiple intermediate representation conversions (you can take a look at torch's exporter code). Here specifically exporter produces if else conditionals inside ONNX graph sole purpose of which is to introduce dynamic ranking of some tensors. In simple terms, ONNX graph is over complicated for no good reason which puts a lot of pressure on TRT's compiler that has a hard time in this case.

In general, I always recommend people to learn write custom kernels. You can start by writing a simple GEMM on a GPU with CUDA (start small with triple nested loop and make your way to shared mem usage, wave quantization, tiling, etc.). Eventually, get to CUTLASS. Low level GPU optimizations make a big difference. I wish you a lot of luck!)

Huxwell commented 6 months ago

For reference, onnx_graphsurgeon fails to load my detectron2 keypoints model (which works perfectly in onnx_runtime now) with un-googlable:

onnx_graphsurgeon/importers/onnx_importer.py", line 81, in get_onnx_tensor_dtype
    if onnx_dtype in onnx.mapping.TENSOR_TYPE_TO_NP_TYPE:
TypeError: unhashable type: 'Opaque'

However, after replacing onnx_graphsurgeon cloned from git version with simple pip install onnx_graphsurgeon==0.5.2 the problem doesn't occur anymore. Posting this just so that ppl don't spend a few hours with debugger as I did if it happens to them.

Huxwell commented 5 months ago

@azhurkevich if possible, I'd be really grateful for some advice from you. I more-or-less ported Detectron2 Keypoints RCNN to TensorRT (I don't even have problems with resnet18 backbone, or using custom keypoints schema anymore) by modifying your Mask RCNN porting code. My model predicts heatmaps with TensorRT and I compute keypoints in postprocessing using numpy (it's a small enough operation, that it doesn't influence performance in visible way).

However, my heatmaps/keypoints are only correct if I have only a few objects in my test image. When I have more than 1-5 objects, only some objects have correct predictions. This is most likely due to wrong ordering of heatmaps. I have 100 bboxes and 100 heatmaps (which are 28x28 and I have to rescale and reposition them w.r.t. respective bounding box). Probably these 100 objects are in different order. It seems roi_heads() version that I use is imperfect.

def roi_heads(rpn_outputs, p2, p3, p4, p5, second_nms_threshold):
    """
    Updates the graph to replace all ROIAlign Caffe ops with one single pyramid ROIAlign.
    Eliminates CollectRpnProposals, DistributeFpnProposals, and BatchPermutation nodes that are not supported by TensorRT.
    Connects pyramid ROIAlign to box_head and connects box_head to final box head outputs in the form of second NMS.
    :param rpn_outputs: Outputs of the first NMS/proposal generator.
    :param p2: Output of p2 feature map, required for ROIAlign operation.
    :param p3: Output of p3 feature map, required for ROIAlign operation.
    :param p4: Output of p4 feature map, required for ROIAlign operation.
    :param p5: Output of p5 feature map, required for ROIAlign operation.
    :param second_nms_threshold: Override the 2nd NMS score threshold value. If set to None, use the value in the graph.
    """
    # Create ROIAlign node for bounding boxes.
    box_pooler_output = self.ROIAlign(
        rpn_outputs[1], p2, p3, p4, p5,
        self.first_ROIAlign_pooled_size,
        self.first_ROIAlign_sampling_ratio,
        self.first_ROIAlign_type,
        self.first_NMS_max_proposals,
        'box_pooler'
    )

    # Reshape node for ROIAlign/box pooler output.
    box_pooler_shape = np.asarray(
        [-1, self.fpn_out_channels * self.first_ROIAlign_pooled_size * self.first_ROIAlign_pooled_size],
        dtype=np.int64
    )
    box_pooler_reshape = self.graph.op_with_const("Reshape", "box_pooler/reshape", box_pooler_output, box_pooler_shape)

    # Get the first Gemm op of box head and connect the box pooler to it.
    first_box_head_gemm = self.graph.find_node_by_op_name("Gemm", "/roi_heads/box_head/fc1/Gemm")
    first_box_head_gemm.inputs[0] = box_pooler_reshape[0]

    # Get final two nodes of the box predictor.
    cls_score = self.graph.find_node_by_op_name("Softmax", "/roi_heads/Softmax")
    bbox_pred = self.graph.find_node_by_op_name("Gemm", "/roi_heads/box_predictor/bbox_pred/Gemm")

    # Linear transformation to convert box coordinates from (TopLeft, BottomRight) Corner encoding
    # to CenterSize encoding.
    matmul_const = np.matrix(
        '0.5 0 -1 0; 0 0.5 0 -1; 0.5 0 1 0; 0 0.5 0 1',
        dtype=np.float32
    )
    matmul_out = self.graph.matmul("RPN_NMS/detection_boxes_conversion", rpn_outputs[1], matmul_const)

    # Reshape node for bbox_pred, preparing for scaling and second NMS.
    bbox_pred_shape = np.asarray(
        [self.batch_size, self.first_NMS_max_proposals, self.num_classes, 4],
        dtype=np.int64
    )
    bbox_pred_reshape = self.graph.op_with_const("Reshape", "bbox_pred/reshape", bbox_pred.outputs[0], bbox_pred_shape)

    # Scale bbox_pred_reshape to get accurate coordinates.
    scale_adj = np.expand_dims(np.asarray([0.1, 0.1, 0.2, 0.2], dtype=np.float32), axis=(0, 1))
    final_bbox_pred = self.graph.op_with_const("Mul", "bbox_pred/scale", bbox_pred_reshape[0], scale_adj)

    # Reshape node for cls_score, preparing for slicing and second NMS.
    cls_score_shape = np.array([self.batch_size, self.first_NMS_max_proposals, self.num_classes + 1], dtype=np.int64)
    cls_score_reshape = self.graph.op_with_const("Reshape", "cls_score/reshape", cls_score.outputs[0], cls_score_shape)

    # Slice operation to adjust the third dimension of cls_score tensor, removing the background class.
    final_cls_score = self.graph.slice("cls_score/slicer", cls_score_reshape[0], 0, self.num_classes, 2)

    # Create NMS node.
    nms_outputs = self.NMS(
        final_bbox_pred[0], final_cls_score[0], matmul_out[0],
        -1, False, self.second_NMS_max_proposals,
        self.second_NMS_iou_threshold, self.second_NMS_score_threshold,
        second_nms_threshold, 'box_outputs'
    )

    keypoint_pooler_output = self.ROIAlign(nms_outputs[1], p2, p3, p4, p5, 
                                            self.second_ROIAlign_pooled_size, 
                                            self.second_ROIAlign_sampling_ratio, 
                                            self.second_ROIAlign_type, 
                                            self.second_NMS_max_proposals, 'keypoint_pooler') #[1, 100, 256, 14, 14],
    print(f"keypoint_pooler_output: {keypoint_pooler_output}")
    # Reshape the output from ROIAlign to match the expected input of convolution layers
    heatmap_reshape_node = self.graph.op_with_const("Reshape","keypoint_pooler/reshape", keypoint_pooler_output, np.asarray([self.second_NMS_max_proposals, self.fpn_out_channels, self.second_ROIAlign_pooled_size, self.second_ROIAlign_pooled_size], dtype=np.int64))
    print(f"heatmap_reshape_node: {heatmap_reshape_node}")

    key_head_conv = self.graph.find_node_by_op_name("Conv", "/roi_heads/keypoint_head/conv_fcn1/Conv")
    '''
    Inputs: [
        Variable (/roi_heads/keypoint_pooler/ScatterND_3_output_0): (shape=['unk__306', 'unk__307', 'unk__308', 'unk__309'], dtype=float32)
        Constant (model.roi_heads.keypoint_head.conv_fcn1.weight): (shape=[512, 256, 3, 3], dtype=float32)
        Constant (model.roi_heads.keypoint_head.conv_fcn1.bias): (shape=[512], dtype=float32)
    ]
    Outputs: [
        Variable (/roi_heads/keypoint_head/conv_fcn1/Conv_output_0): (shape=['unk__306', 512, 'unk__310', 'unk__311'], dtype=float32)
    ]
    '''

    key_head_conv.inputs[0] = heatmap_reshape_node[0] #inputs[0] because weights and biases at [1] and [2] are also considered inputs; node has [Variable] inside, so[0]

    # Filip: here we need to connect MaskRCNNConvUpsampleHead
    # (conv_fcn8): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    # (conv_fcn_relu8): ReLU()
    # (score_lowres): ConvTranspose2d(512, 17, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))

    last_conv = self.graph.find_node_by_op_name("ConvTranspose", "/roi_heads/keypoint_head/score_lowres/ConvTranspose")
    # print(f"last_conv {last_conv}")
    # Alternatively consider using /roi_heads/keypoint_head/conv_fcn8/Conv  or roi_heads/keypoint_head/conv_fcn_relu8/Relu as intermediate output?

    # Final Reshape node from mask code, not sure if necessary. In keypoint code reshapes ReLU output rather than Sigmoid? May be important for batch!=1 support.

    final_graph_reshape_shape = np.asarray([self.second_NMS_max_proposals, self.num_keypoints, self.key_out_res, self.key_out_res], dtype=np.int64)
    print(f"final_graph_reshape_shape: {final_graph_reshape_shape}")
    final_graph_reshape_node = self.graph.op_with_const("Reshape", "key_head/final_reshape", last_conv.outputs[0], final_graph_reshape_shape)
    final_graph_reshape_node[0].dtype = np.float32
    final_graph_reshape_node[0].name = "detection_keys"

    return nms_outputs, final_graph_reshape_node[0]

I haven't re-implemented operations below, because I can't understand:

What are they doing, how and why do you need them? Why aren't 100 bboxes and 100 masks be ordered the same way from the start (if we treat them the same way, work with the same proposals after ROIAlign) ?
Do these operations still make sense if my activation is ReLU not Sigmoid?

Do I need to specifically reference my ReLU, or is pointing to self.graph.find_node_by_op_name("ConvTranspose", "/roi_heads/keypoint_head/score_lowres/ConvTranspose") enough?


# Reshape node that is preparing 2nd NMS class outputs for Add node that comes next.
classes_reshape_shape = np.asarray([self.second_NMS_max_proposals*self.batch_size], dtype=np.int64)
classes_reshape_node = self.graph.op_with_const("Reshape", "box_outputs/reshape_classes", nms_outputs[3], classes_reshape_shape)

This loop will generate an array used in Add node, which eventually will help Gather node to pick the single

class of interest per bounding box, instead of creating 80 masks for every single bounding box.

add_array = [] for i in range(self.second_NMS_max_proposalsself.batch_size): if i == 0: start_pos = 0 else: start_pos = i self.num_classes add_array.append(start_pos)

This Add node is one of the Gather node inputs, Gather node performs gather on 0th axis of data tensor

and requires indices that set tensors to be withing bounds, this Add node provides the bounds for Gather.

add_array = np.asarray(add_array, dtype=np.int32) classes_add_node = self.graph.op_with_const("Add", "box_outputs/add", classes_reshape_node[0], add_array)

Get the last Conv op in mask head and reshape it to correctly gather class of interest's masks.

''' at the end of MaskRCNNConvUpsampleHead, this is the last layer in the network from print(pytroch_model) (predictor): Conv2d(256, 80, kernel_size=(1, 1), stride=(1, 1))

''' last_conv = self.graph.find_node_by_op_name("Conv", "/roi_heads/mask_head/predictor/Conv")

last_conv_reshape_shape = np.asarray([self.second_NMS_max_proposalsself.num_classesself.batch_size, self.mask_out_res, self.mask_out_res], dtype=np.int64) last_conv_reshape_node = self.graph.op_with_const("Reshape", "mask_head/reshape_all_masks", last_conv.outputs[0], last_conv_reshape_shape)

Gather node that selects only masks belonging to detected class, 79 other masks are discarded.

final_gather = self.graph.gather("mask_head/final_gather", last_conv_reshape_node[0], classes_add_node[0], 0)

Get last Sigmoid node and connect Gather node to it.

mask_head_sigmoid = self.graph.find_node_by_op_name("Sigmoid", "/roi_heads/mask_head/Sigmoid") mask_head_sigmoid.inputs[0] = final_gather[0]

Huxwell commented 5 months ago

In fact the heatmaps were in the correct order, the way I modified the visualization script didn't take account for different format of the tensor.

The gather ( https://onnx.ai/onnx/operators/onnx__Gather.html ) seems to be unnecessary for keypoints. The snippet above works well, and requires heatmap->keypoint conversion in postprocessing ( https://github.com/facebookresearch/detectron2/blob/main/detectron2/structures/keypoints.py reimplemented in numpy + rescaling and repositioning to bbox) for final predictions.

azhurkevich commented 5 months ago

@Huxwell sounds awesome, it seems like you've figured it out)

ttyio commented 4 months ago

closing since this is already solved!

NVIDIA / TensorRT

Virtually no detections after detectron2->onnx->onnx->TensorRT conversion of Mask RCNN. #3792

Description

Environment

This loop will generate an array used in Add node, which eventually will help Gather node to pick the single

class of interest per bounding box, instead of creating 80 masks for every single bounding box.

This Add node is one of the Gather node inputs, Gather node performs gather on 0th axis of data tensor

and requires indices that set tensors to be withing bounds, this Add node provides the bounds for Gather.

Get the last Conv op in mask head and reshape it to correctly gather class of interest's masks.

Gather node that selects only masks belonging to detected class, 79 other masks are discarded.

Get last Sigmoid node and connect Gather node to it.