onnx2trt with nv_half and nv_half2 failed

WYYAHYT commented 1 year ago

Command: python tools/bevformer/onnx2trt.py configs/bevformer/plugin/bevformer_tiny_trt_p.py checkpoints/onnx/bevformer_tiny_epoch_24_cp.onnx

Error (Catch by faulthandler):

[02/24/2023-11:15:11] [TRT] [V] Searching for input: onnx::Expand_1007
[02/24/2023-11:15:11] [TRT] [V] Searching for input: onnx::Expand_1008
[02/24/2023-11:15:11] [TRT] [V] node_of_onnx::Expand_1009 [Expand] inputs: [onnx::Expand_1007 -> ()[INT32]], [onnx::Expand_1008 -> (0)[INT32]], 
[02/24/2023-11:15:11] [TRT] [V] Registering layer: onnx::Expand_1007 for ONNX node: onnx::Expand_1007
Fatal Python error: Segmentation fault

Current thread 0x0000ffffa6e5a9c0 (most recent call first):
  File "/data/projects/bevformer_tensorrt/BEVFormer_tensorrt/./det2trt/convert/onnx2tensorrt.py", line 38 in build_engine
  File "tools/bevformer/onnx2trt.py", line 259 in main
  File "tools/bevformer/onnx2trt.py", line 271 in <module>
Segmentation fault (core dumped)

And File "/data/projects/bevformer_tensorrt/BEVFormer_tensorrt/./det2trt/convert/onnx2tensorrt.py", line 38 in build_engine is the code for parsing onnx model, then I tried to regenerate onnx model(successfully) and convert to trt, but failed again.

Maybe the onnx model is not correct? So I checked with onnx.checker.check_model(onnx_model), Incorrect indeed

But there was no error(only warnings) occured in the process of pth2onnx with nv_half or nv_half2, Warning below with command: python tools/pth2onnx.py configs/bevformer/plugin/bevformer_tiny_trt_p.py checkpoints/pytorch/bevformer_tiny_epoch_24.pth --opset_version 13 --cuda --flag cp

Loaded tensorrt plugins from /data/projects/bevformer_tensorrt/BEVFormer_tensorrt/TensorRT/lib/libtensorrt_ops.so
/data/projects/bevformer_tensorrt/BEVFormer_tensorrt/./third_party/bevformer/models/detectors/mvx_two_stage.py:86: UserWarning: DeprecationWarning: pretrained is a deprecated                     key, please consider using init_cfg
  warnings.warn(
load checkpoint from local path: checkpoints/pytorch/bevformer_tiny_epoch_24.pth
/data/anaconda3/envs/bevformer_tensorrt/lib/python3.8/site-packages/torch/onnx/utils.py:294: UserWarning: `add_node_names' can be set to True only when 'operator_export_type' is `ONNX`. Since 'operator_export_type' is not set to 'ONNX', `add_node_names` argument will be ignored.
  warnings.warn("`{}' can be set to True only when 'operator_export_type' is "
/data/anaconda3/envs/bevformer_tensorrt/lib/python3.8/site-packages/torch/nn/modules/module.py:1402: UserWarning: positional arguments and argument "destination" are deprecated. nn.Module.state_dict will not accept them in the future. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
  warnings.warn(
/data/projects/bevformer_tensorrt/BEVFormer_tensorrt/./det2trt/models/detector/bevformer.py:34: TracerWarning: Converting a tensor to a Python integer might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  img_feats_reshaped.append(img_feat.view(B, int(BN / B), C, H, W))
/data/projects/bevformer_tensorrt/BEVFormer_tensorrt/./det2trt/models/detector/bevformer.py:40: TracerWarning: Using len to get tensor shape might cause the trace to be incorrect. Recommended usage would be tensor.shape[0]. Passing a tensor of different shape might lead to errors or silently give incorrect results.
  assert len(img_feats[0]) == 1
/data/projects/bevformer_tensorrt/BEVFormer_tensorrt/./det2trt/models/functions/rotate.py:15: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than tensor.new_tensor(sourceTensor).
  cx = center[0] - center[0].new_tensor(ow * 0.5)
/data/projects/bevformer_tensorrt/BEVFormer_tensorrt/./det2trt/models/functions/rotate.py:16: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than tensor.new_tensor(sourceTensor).
  cy = center[1] - center[1].new_tensor(oh * 0.5)
/data/projects/bevformer_tensorrt/BEVFormer_tensorrt/./det2trt/models/modules/transformer.py:320: TracerWarning: Converting a tensor to a Python integer might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  spatial_shapes[lvl, 0] = int(h)
/data/projects/bevformer_tensorrt/BEVFormer_tensorrt/./det2trt/models/modules/transformer.py:321: TracerWarning: Converting a tensor to a Python integer might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  spatial_shapes[lvl, 1] = int(w)
/data/projects/bevformer_tensorrt/BEVFormer_tensorrt/./det2trt/models/modules/encoder.py:199: TracerWarning: torch.tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.
  reference_points = reference_points * torch.tensor(
/data/projects/bevformer_tensorrt/BEVFormer_tensorrt/./det2trt/models/modules/encoder.py:207: TracerWarning: torch.tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.
  ).view(1, 1, 1, 3) + torch.tensor(
/data/projects/bevformer_tensorrt/BEVFormer_tensorrt/./det2trt/models/modules/encoder.py:226: TracerWarning: Converting a tensor to a Python integer might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  int(reference_points_cam.shape[3]),
/data/projects/bevformer_tensorrt/BEVFormer_tensorrt/./det2trt/models/modules/encoder.py:600: TracerWarning: torch.tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.
  spatial_shapes=torch.tensor([[bev_h, bev_w]], device=query.device),
/data/projects/bevformer_tensorrt/BEVFormer_tensorrt/./det2trt/models/modules/encoder.py:601: TracerWarning: torch.tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.
  level_start_index=torch.tensor([0], device=query.device),
/data/projects/bevformer_tensorrt/BEVFormer_tensorrt/./det2trt/models/modules/temporal_self_attention.py:409: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  assert (spatial_shapes[:, 0] * spatial_shapes[:, 1]).sum() == value.shape[1]
/data/projects/bevformer_tensorrt/BEVFormer_tensorrt/./det2trt/models/modules/temporal_self_attention.py:461: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if reference_points.shape[-1] == 2:
/data/projects/bevformer_tensorrt/BEVFormer_tensorrt/./det2trt/models/modules/spatial_cross_attention.py:256: TracerWarning: Converting a tensor to a Python integer might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  self.num_cams, -1, int(reference_points_cam.size(3)), 2
/data/projects/bevformer_tensorrt/BEVFormer_tensorrt/./det2trt/models/modules/spatial_cross_attention.py:751: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  assert (spatial_shapes[:, 0] * spatial_shapes[:, 1]).sum() == value.shape[1]
/data/projects/bevformer_tensorrt/BEVFormer_tensorrt/./det2trt/models/modules/spatial_cross_attention.py:767: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if reference_points.shape[-1] == 2:
/data/projects/bevformer_tensorrt/BEVFormer_tensorrt/./det2trt/models/modules/transformer.py:392: TracerWarning: torch.tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.
  spatial_shapes=torch.tensor([[bev_h, bev_w]], device=query.device),
/data/projects/bevformer_tensorrt/BEVFormer_tensorrt/./det2trt/models/modules/transformer.py:393: TracerWarning: torch.tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.
  level_start_index=torch.tensor([0], device=query.device),
/data/projects/bevformer_tensorrt/BEVFormer_tensorrt/./det2trt/models/modules/decoder.py:443: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  assert (spatial_shapes[:, 0] * spatial_shapes[:, 1]).sum() == value.shape[1]
/data/projects/bevformer_tensorrt/BEVFormer_tensorrt/./det2trt/models/modules/decoder.py:460: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if reference_points.shape[-1] == 2:
/data/projects/bevformer_tensorrt/BEVFormer_tensorrt/./det2trt/models/modules/decoder.py:96: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  assert reference_points.shape[-1] == 3
/data/projects/bevformer_tensorrt/BEVFormer_tensorrt/./det2trt/models/dense_heads/bevformer_head.py:249: TracerWarning: Converting a tensor to a Python integer might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  for lvl in range(int(hs.shape[0])):
/data/projects/bevformer_tensorrt/BEVFormer_tensorrt/./det2trt/models/dense_heads/bevformer_head.py:258: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  assert reference.shape[-1] == 3
WARNING: The shape inference of prim::Constant type is missing, so it may result in wrong shape inference for the exported graph. Please consider adding it in symbolic function.
WARNING: The shape inference of prim::Constant type is missing, so it may result in wrong shape inference for the exported graph. Please consider adding it in symbolic function.
WARNING: The shape inference of prim::Constant type is missing, so it may result in wrong shape inference for the exported graph. Please consider adding it in symbolic function.
WARNING: The shape inference of prim::Constant type is missing, so it may result in wrong shape inference for the exported graph. Please consider adding it in symbolic function.
WARNING: The shape inference of prim::Constant type is missing, so it may result in wrong shape inference for the exported graph. Please consider adding it in symbolic function.
WARNING: The shape inference of prim::Constant type is missing, so it may result in wrong shape inference for the exported graph. Please consider adding it in symbolic function.
WARNING: The shape inference of prim::Constant type is missing, so it may result in wrong shape inference for the exported graph. Please consider adding it in symbolic function.
WARNING: The shape inference of prim::Constant type is missing, so it may result in wrong shape inference for the exported graph. Please consider adding it in symbolic function.
WARNING: The shape inference of prim::Constant type is missing, so it may result in wrong shape inference for the exported graph. Please consider adding it in symbolic function.
WARNING: The shape inference of prim::Constant type is missing, so it may result in wrong shape inference for the exported graph. Please consider adding it in symbolic function.
WARNING: The shape inference of prim::Constant type is missing, so it may result in wrong shape inference for the exported graph. Please consider adding it in symbolic function.
WARNING: The shape inference of prim::Constant type is missing, so it may result in wrong shape inference for the exported graph. Please consider adding it in symbolic function.
WARNING: The shape inference of prim::Constant type is missing, so it may result in wrong shape inference for the exported graph. Please consider adding it in symbolic function.
[W shape_type_inference.cpp:436] Warning: Constant folding in symbolic shape inference fails: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper__index_select) (function ComputeConstantFolding)
WARNING: The shape inference of prim::Constant type is missing, so it may result in wrong shape inference for the exported graph. Please consider adding it in symbolic function.
WARNING: The shape inference of prim::Constant type is missing, so it may result in wrong shape inference for the exported graph. Please consider adding it in symbolic function.
WARNING: The shape inference of prim::Constant type is missing, so it may result in wrong shape inference for the exported graph. Please consider adding it in symbolic function.
WARNING: The shape inference of prim::Constant type is missing, so it may result in wrong shape inference for the exported graph. Please consider adding it in symbolic function.
WARNING: The shape inference of prim::Constant type is missing, so it may result in wrong shape inference for the exported graph. Please consider adding it in symbolic function.
WARNING: The shape inference of prim::Constant type is missing, so it may result in wrong shape inference for the exported graph. Please consider adding it in symbolic function.
WARNING: The shape inference of prim::Constant type is missing, so it may result in wrong shape inference for the exported graph. Please consider adding it in symbolic function.
WARNING: The shape inference of prim::Constant type is missing, so it may result in wrong shape inference for the exported graph. Please consider adding it in symbolic function.
WARNING: The shape inference of prim::Constant type is missing, so it may result in wrong shape inference for the exported graph. Please consider adding it in symbolic function.
WARNING: The shape inference of prim::Constant type is missing, so it may result in wrong shape inference for the exported graph. Please consider adding it in symbolic function.
WARNING: The shape inference of prim::Constant type is missing, so it may result in wrong shape inference for the exported graph. Please consider adding it in symbolic function.
WARNING: The shape inference of prim::Constant type is missing, so it may result in wrong shape inference for the exported graph. Please consider adding it in symbolic function.
WARNING: The shape inference of prim::Constant type is missing, so it may result in wrong shape inference for the exported graph. Please consider adding it in symbolic function.
[W shape_type_inference.cpp:436] Warning: Constant folding in symbolic shape inference fails: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper__index_select) (function ComputeConstantFolding)
WARNING: The shape inference of prim::Constant type is missing, so it may result in wrong shape inference for the exported graph. Please consider adding it in symbolic function.
WARNING: The shape inference of prim::Constant type is missing, so it may result in wrong shape inference for the exported graph. Please consider adding it in symbolic function.
WARNING: The shape inference of prim::Constant type is missing, so it may result in wrong shape inference for the exported graph. Please consider adding it in symbolic function.
WARNING: The shape inference of prim::Constant type is missing, so it may result in wrong shape inference for the exported graph. Please consider adding it in symbolic function.
WARNING: The shape inference of prim::Constant type is missing, so it may result in wrong shape inference for the exported graph. Please consider adding it in symbolic function.
WARNING: The shape inference of prim::Constant type is missing, so it may result in wrong shape inference for the exported graph. Please consider adding it in symbolic function.
WARNING: The shape inference of prim::Constant type is missing, so it may result in wrong shape inference for the exported graph. Please consider adding it in symbolic function.
WARNING: The shape inference of prim::Constant type is missing, so it may result in wrong shape inference for the exported graph. Please consider adding it in symbolic function.
WARNING: The shape inference of prim::Constant type is missing, so it may result in wrong shape inference for the exported graph. Please consider adding it in symbolic function.
WARNING: The shape inference of prim::Constant type is missing, so it may result in wrong shape inference for the exported graph. Please consider adding it in symbolic function.
WARNING: The shape inference of prim::Constant type is missing, so it may result in wrong shape inference for the exported graph. Please consider adding it in symbolic function.
WARNING: The shape inference of prim::Constant type is missing, so it may result in wrong shape inference for the exported graph. Please consider adding it in symbolic function.
WARNING: The shape inference of prim::Constant type is missing, so it may result in wrong shape inference for the exported graph. Please consider adding it in symbolic function.
ONNX file has been saved in checkpoints/onnx/nv_half/bevformer_tiny_epoch_24_cp.onnx

And here is the ONNX model generated

Rellay hope someone could help!

WYYAHYT commented 1 year ago

More details result for onnx,check_model:

The model is invalid: No Op registered for RotateTRT with domain_version of 13

==> Context: Bad node spec for node. Name:  OpType: RotateTRT

and

The model is invalid: No Op registered for ModulatedDeformableConv2dTRT with domain_version of 13

==> Context: Bad node spec for node. Name:  OpType: ModulatedDeformableConv2dTRT

It seems to be the error of ONNX operator, should I add support for these operators?

DerryHub commented 1 year ago

A segmentation fault is typically caused by a program trying to read from or write to an illegal memory location, that is, part of the memory to which the program is not supposed to have access. Please check your memory usage status when onnx2trt. This issue has nothing to do with

More details result for onnx,check_model:

The model is invalid: No Op registered for RotateTRT with domain_version of 13

==> Context: Bad node spec for node. Name:  OpType: RotateTRT

and

The model is invalid: No Op registered for ModulatedDeformableConv2dTRT with domain_version of 13

==> Context: Bad node spec for node. Name:  OpType: ModulatedDeformableConv2dTRT

It seems to be the error of ONNX operator, should I add support for these operators?

serwansj commented 1 year ago

We are having the same issue, have you found a solution @WYYAHYT ?

matcosta23 commented 1 year ago

I'm also having the same issue as @WYYAHYT and @serwansj . I got the following error when trying to optimize the ONNX graph produced from the custom plugins with the onnxsim tool:

onnx.onnx_cpp2py_export.checker.ValidationError: No Op registered for RotateTRT with domain_version of 13

==> Context: Bad node spec for node. Name: RotateTRT_263 OpType: RotateTRT

Have you find a solution for this?

WYYAHYT commented 1 year ago

Sorry to say that I don't have enough time to solve this problem, and really hope that you guys can solve it. @serwansj @matcosta23

sun-lingyu commented 1 year ago

Similar problem (Segmentation Fault) met when running sh samples/bevformer/plugin/small/onnx2trt_fp16_2.sh -d 0. Any help is appreciated!

sun-lingyu commented 1 year ago

My problem is caused by not using TensorRT 8.4.1 instead of the recommended TensorRT 8.5 version. In NVIDIA's official release note of tensorrt, you can find that 8.4.1 version has a known issue which has been fixed in TensorRT 8.4.3: When parsing networks with ONNX operand expand on scalar input. TensorRT would error out. This issue has been fixed in this release.

I believe using TensorRT 8.5.1.7, as recommended by the author of this repo, and the command would go normal.

However, as I am currently using a Jetson Orin which only has TensorRT 8.5.2 pakage (which does not has this "Expand Operator Error" but has other errors), I cannot try 8.5.1.7.

Anyway, I believe this error is caused by incompatible TensorRT version. You can try it up.

Similar problem (Segmentation Fault) met when running sh samples/bevformer/plugin/small/onnx2trt_fp16_2.sh -d 0. Any help is appreciated!

SimeonZhang commented 7 months ago

The same issue as @WYYAHYT , @serwansj , @matcosta23 and @sun-lingyu , I'm using a Drive Orin Devkit with TensorRT 8.4.11, while with 8.5.3 on a x86 machine, this error go away. I believe it's caused by incompatible version, however I can not update the packages by now. Any suggestion walking around this problem?

SimeonZhang commented 7 months ago

Just as @sun-lingyu said, It's caused by the issue: When parsing networks with ONNX operand expand on scalar input. TensorRT would error out. This issue has been fixed in this release. To be precise, it's caused by this line https://github.com/DerryHub/BEVFormer_tensorrt/blob/303d3140c14016047c07f9db73312af364f0dd7c/det2trt/models/modules/transformer.py#L313

So I modify the code to TRT counterpart as:

        feat_flatten = []
        spatial_shapes = []
        for lvl, feat in enumerate(mlvl_feats):
            bs, num_cam, c, h, w = feat.shape
            spatial_shape = (h, w)
            feat = feat.flatten(3).permute(1, 0, 3, 2)
            if self.use_cams_embeds:
                feat = feat + self.cams_embeds[:, None, None, :].to(feat.dtype)
            feat = feat + self.level_embeds[None,
                                            None, lvl:lvl + 1, :].to(feat.dtype)
            spatial_shapes.append(spatial_shape)
            feat_flatten.append(feat)

        spatial_shapes = torch.as_tensor(
            spatial_shapes, dtype=torch.long, device=bev_pos.device
        )
        level_start_index = torch.cat(
            (spatial_shapes.new_zeros((1,)), spatial_shapes.prod(1).cumsum(0)[:-1])
        )

        for i in range(len(feat_flatten)):
            feat_flatten[i] = feat_flatten[i].permute(0, 2, 1, 3)
        feat_flatten = torch.stack(feat_flatten)

        bev_embed = self.encoder.forward_trt(
            bev_queries,
            feat_flatten,
            feat_flatten,
            lidar2img=lidar2img,
            bev_h=bev_h,
            bev_w=bev_w,
            bev_pos=bev_pos,
            spatial_shapes=spatial_shapes,
            level_start_index=level_start_index,
            prev_bev=prev_bev,
            shift=shift,
            image_shape=image_shape,
            use_prev_bev=use_prev_bev,
        )

        return bev_embed

It works for me. FYI @serwansj @matcosta23 @WYYAHYT

DerryHub / BEVFormer_tensorrt

onnx2trt with nv_half and nv_half2 failed #37