NVIDIA / TensorRT

NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.
https://developer.nvidia.com/tensorrt
Apache License 2.0
10.68k stars 2.12k forks source link

ERROR: 10: Could not find any implementation for node and ERROR 9: Skipping tactic 0x0000000000000000 due to exception Unexpected type in resetWeightsTypeIfFP TensorRT 9.3 on GPU RTX4090 #3748

Closed bernardrb closed 5 months ago

bernardrb commented 6 months ago

Description

We are trying to experiment with mixed-precision with different precision formats. In this build, we were attempting to cast one "stage" to FP8, and casting the rest to FP16. We were using PREFER_PRECISION_CONSTRAINTS. In the Google Drive link, is relevant code, failed build logs, .onnx-model, and polygraphy inspections of the .onnx.

Errors:

[03/28/2024-08:27:45] [TRT] [V] =============== Computing costs for {ForeignNode[/image_encoder/backbone/stages.3/op_list.4/main/spatial_conv/act/Constant_3_output_0 + ONNXTRT_Broadcast_101.../image_encoder/backbone/stages.3/op_list.4/Add]}
[03/28/2024-08:27:45] [TRT] [V] *************** Autotuning format combination: Float(2097152,16384,128,1), Float(2097152,16384,128,1) -> Float(1048576,4096,64,1) ***************
[03/28/2024-08:27:45] [TRT] [V] --------------- Timing Runner: {ForeignNode[/image_encoder/backbone/stages.3/op_list.4/main/spatial_conv/act/Constant_3_output_0 + ONNXTRT_Broadcast_101.../image_encoder/backbone/stages.3/op_list.4/Add]} (Myelin[0x80000023])
[03/28/2024-08:27:45] [TRT] [E] 9: Skipping tactic 0x0000000000000000 due to exception Unexpected type in resetWeightsTypeIfFP
[03/28/2024-08:27:45] [TRT] [V] {ForeignNode[/image_encoder/backbone/stages.3/op_list.4/main/spatial_conv/act/Constant_3_output_0 + ONNXTRT_Broadcast_101.../image_encoder/backbone/stages.3/op_list.4/Add]} (Myelin[0x80000023]) profiling completed in 0.0141819 seconds. Fastest Tactic: 0xd15ea5edd15ea5ed Time: inf
[03/28/2024-08:27:45] [TRT] [V] *************** Autotuning format combination: Half(2097152,16384,128,1), Half(2097152,16384,128,1) -> Half(1048576,4096,64,1) ***************
[03/28/2024-08:27:45] [TRT] [V] --------------- Timing Runner: {ForeignNode[/image_encoder/backbone/stages.3/op_list.4/main/spatial_conv/act/Constant_3_output_0 + ONNXTRT_Broadcast_101.../image_encoder/backbone/stages.3/op_list.4/Add]} (Myelin[0x80000023])
[03/28/2024-08:27:45] [TRT] [E] 9: Skipping tactic 0x0000000000000000 due to exception Unexpected type in resetWeightsTypeIfFP
[03/28/2024-08:27:45] [TRT] [V] {ForeignNode[/image_encoder/backbone/stages.3/op_list.4/main/spatial_conv/act/Constant_3_output_0 + ONNXTRT_Broadcast_101.../image_encoder/backbone/stages.3/op_list.4/Add]} (Myelin[0x80000023]) profiling completed in 0.0165441 seconds. Fastest Tactic: 0xd15ea5edd15ea5ed Time: inf
[03/28/2024-08:27:45] [TRT] [V] *************** Autotuning format combination: Half(262144,1:8,2048,16), Half(262144,1:8,2048,16) -> Half(131072,1:8,2048,32) ***************
[03/28/2024-08:27:45] [TRT] [V] --------------- Timing Runner: {ForeignNode[/image_encoder/backbone/stages.3/op_list.4/main/spatial_conv/act/Constant_3_output_0 + ONNXTRT_Broadcast_101.../image_encoder/backbone/stages.3/op_list.4/Add]} (Myelin[0x80000023])
[03/28/2024-08:27:45] [TRT] [E] 9: Skipping tactic 0x0000000000000000 due to exception Unexpected type in resetWeightsTypeIfFP
[03/28/2024-08:27:45] [TRT] [V] {ForeignNode[/image_encoder/backbone/stages.3/op_list.4/main/spatial_conv/act/Constant_3_output_0 + ONNXTRT_Broadcast_101.../image_encoder/backbone/stages.3/op_list.4/Add]} (Myelin[0x80000023]) profiling completed in 0.0164769 seconds. Fastest Tactic: 0xd15ea5edd15ea5ed Time: inf
[03/28/2024-08:27:45] [TRT] [V] *************** Autotuning format combination: FP8(2097152,16384,128,1), FP8(2097152,16384,128,1) -> FP8(1048576,4096,64,1) ***************
[03/28/2024-08:27:45] [TRT] [V] --------------- Timing Runner: {ForeignNode[/image_encoder/backbone/stages.3/op_list.4/main/spatial_conv/act/Constant_3_output_0 + ONNXTRT_Broadcast_101.../image_encoder/backbone/stages.3/op_list.4/Add]} (Myelin[0x80000023])
[03/28/2024-08:27:45] [TRT] [E] 9: Skipping tactic 0x0000000000000000 due to exception Unexpected type in resetWeightsTypeIfFP
[03/28/2024-08:27:45] [TRT] [V] {ForeignNode[/image_encoder/backbone/stages.3/op_list.4/main/spatial_conv/act/Constant_3_output_0 + ONNXTRT_Broadcast_101.../image_encoder/backbone/stages.3/op_list.4/Add]} (Myelin[0x80000023]) profiling completed in 0.0164394 seconds. Fastest Tactic: 0xd15ea5edd15ea5ed Time: inf
[03/28/2024-08:27:45] [TRT] [W] No valid obedient candidate choices for node {ForeignNode[/image_encoder/backbone/stages.3/op_list.4/main/spatial_conv/act/Constant_3_output_0 + ONNXTRT_Broadcast_101.../image_encoder/backbone/stages.3/op_list.4/Add]} that meet the preferred precision. The remaining candidate choices will be profiled.
[03/28/2024-08:27:45] [TRT] [E] 10: Could not find any implementation for node {ForeignNode[/image_encoder/backbone/stages.3/op_list.4/main/spatial_conv/act/Constant_3_output_0 + ONNXTRT_Broadcast_101.../image_encoder/backbone/stages.3/op_list.4/Add]}.
[03/28/2024-08:27:45] [TRT] [E] 10: [optimizer.cpp::computeCosts::4048] Error Code 10: Internal Error (Could not find any implementation for node {ForeignNode[/image_encoder/backbone/stages.3/op_list.4/main/spatial_conv/act/Constant_3_output_0 + ONNXTRT_Broadcast_101.../image_encoder/backbone/stages.3/op_list.4/Add]}.)

From https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#error-messaging,

This error message occurs because there is no layer implementation for the given node in the network that can operate with the given workspace size. This usually occurs because the workspace size is insufficient but could also indicate a bug. If increasing the workspace size as suggested does not help, report a bug (refer to Reporting TensorRT Issues).

But in issue #2035, the workspace should be infinity, and doesn't not have to be configured? The issue seems to be discussed on onnx/ too. https://github.com/onnx/onnx-tensorrt/issues/758

Cannot find any information on Skipping tactic 0x0000000000000000 due to exception Unexpected type in resetWeightsTypeIfFP, but we imagine it has to do with setting precision of layers. Are there limitations to using mixed-precision, in our case FP8, and FP16? Our precision settings were FP16 (stages.0) -> FP16 (stages.1) -> FP16 (stages.2)-> FP8 (stages.3) -> FP16 (stages.4) -> FP16 (stages.5) -> FP16 (neck). Note, the "stages.x" are part of the layer.names to differentiate between different blocks.

The troublesome layer ForeignNode[/image_encoder/backbone/stages.3/op_list.4/main/spatial_conv/act/Constant_3_output_0 + ONNXTRT_Broadcast_101.../image_encoder/backbone/stages.3/op_list.4/Add], and the adjacent layers are set to prefer FP8,

2024-03-28 08:27:29.189 | INFO     | __main__:set_mixed_precision:237 - Layer /image_encoder/backbone/stages.3/op_list.4/main/spatial_conv/conv/Conv set to PREFER FP8 data type
2024-03-28 08:27:29.189 | INFO     | __main__:set_mixed_precision:237 - Layer /image_encoder/backbone/stages.3/op_list.4/main/spatial_conv/act/Mul set to PREFER FP8 data type
2024-03-28 08:27:29.189 | INFO     | __main__:set_mixed_precision:237 - Layer /image_encoder/backbone/stages.3/op_list.4/main/spatial_conv/act/Mul_1 set to PREFER FP8 data type
2024-03-28 08:27:29.189 | INFO     | __main__:set_mixed_precision:237 - Layer /image_encoder/backbone/stages.3/op_list.4/main/spatial_conv/act/Constant_output_0 set to PREFER FP8 data type
2024-03-28 08:27:29.189 | INFO     | __main__:set_mixed_precision:237 - Layer /image_encoder/backbone/stages.3/op_list.4/main/spatial_conv/act/Mul_2 set to PREFER FP8 data type
2024-03-28 08:27:29.189 | INFO     | __main__:set_mixed_precision:237 - Layer /image_encoder/backbone/stages.3/op_list.4/main/spatial_conv/act/Add set to PREFER FP8 data type
2024-03-28 08:27:29.189 | INFO     | __main__:set_mixed_precision:237 - Layer /image_encoder/backbone/stages.3/op_list.4/main/spatial_conv/act/Constant_1_output_0 set to PREFER FP8 data type
2024-03-28 08:27:29.189 | INFO     | __main__:set_mixed_precision:237 - Layer /image_encoder/backbone/stages.3/op_list.4/main/spatial_conv/act/Mul_3 set to PREFER FP8 data type
2024-03-28 08:27:29.190 | INFO     | __main__:set_mixed_precision:237 - Layer /image_encoder/backbone/stages.3/op_list.4/main/spatial_conv/act/Tanh set to PREFER FP8 data type
2024-03-28 08:27:29.190 | INFO     | __main__:set_mixed_precision:237 - Layer /image_encoder/backbone/stages.3/op_list.4/main/spatial_conv/act/Constant_2_output_0 set to PREFER FP8 data type
2024-03-28 08:27:29.190 | INFO     | __main__:set_mixed_precision:237 - Layer /image_encoder/backbone/stages.3/op_list.4/main/spatial_conv/act/Add_1 set to PREFER FP8 data type
2024-03-28 08:27:29.190 | INFO     | __main__:set_mixed_precision:237 - Layer /image_encoder/backbone/stages.3/op_list.4/main/spatial_conv/act/Mul_4 set to PREFER FP8 data type
2024-03-28 08:27:29.190 | INFO     | __main__:set_mixed_precision:237 - Layer /image_encoder/backbone/stages.3/op_list.4/main/spatial_conv/act/Constant_3_output_0 set to PREFER FP8 data type
2024-03-28 08:27:29.190 | INFO     | __main__:set_mixed_precision:237 - Layer /image_encoder/backbone/stages.3/op_list.4/main/spatial_conv/act/Mul_5 set to PREFER FP8 data type

There is also a fusion that is ran,

[03/28/2024-08:27:29] [TRT] [V] Running: ConstShuffleFusion on /image_encoder/backbone/stages.3/op_list.4/main/spatial_conv/act/Constant_output_0
[03/28/2024-08:27:29] [TRT] [V] ConstShuffleFusion: Fusing /image_encoder/backbone/stages.3/op_list.4/main/spatial_conv/act/Constant_output_0 with ONNXTRT_Broadcast_95
[03/28/2024-08:27:29] [TRT] [V] Running: ConstShuffleFusion on /image_encoder/backbone/stages.3/op_list.4/main/spatial_conv/act/Constant_1_output_0
[03/28/2024-08:27:29] [TRT] [V] ConstShuffleFusion: Fusing /image_encoder/backbone/stages.3/op_list.4/main/spatial_conv/act/Constant_1_output_0 with ONNXTRT_Broadcast_97
[03/28/2024-08:27:29] [TRT] [V] Running: ConstShuffleFusion on /image_encoder/backbone/stages.3/op_list.4/main/spatial_conv/act/Constant_2_output_0
[03/28/2024-08:27:29] [TRT] [V] ConstShuffleFusion: Fusing /image_encoder/backbone/stages.3/op_list.4/main/spatial_conv/act/Constant_2_output_0 with ONNXTRT_Broadcast_99
[03/28/2024-08:27:29] [TRT] [V] Running: ConstShuffleFusion on /image_encoder/backbone/stages.3/op_list.4/main/spatial_conv/act/Constant_3_output_0
[03/28/2024-08:27:29] [TRT] [V] ConstShuffleFusion: Fusing /image_encoder/backbone/stages.3/op_list.4/main/spatial_conv/act/Constant_3_output_0 with ONNXTRT_Broadcast_101

troublesome_graph image

Here is the original .onnx graph, with the layers we assume are causing the issue.

How can we avoid this issue? Having tried enabling the FP8 flag for the entire network, and facing the same issue for a different layer, I assume it has to do with FP8. Is there a way to avoid this issue?

Looking at onnx Mul operation for example, it might be that FP8 is not supported. Type Constraints T in ( tensor(bfloat16), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8) ):

Also, is there a way to see the complete ForeignNode as it seems to hide some of the layers for brevity: ForeignNode[/image_encoder/backbone/stages.3/op_list.4/main/spatial_conv/act/Constant_3_output_0 + ONNXTRT_Broadcast_101.../image_encoder/backbone/stages.3/op_list.4/Add]

Warnings:

[03/28/2024-08:27:29] [TRT] [W] Detected layernorm nodes in FP16.
[03/28/2024-08:27:29] [TRT] [V] /image_encoder/norm/ReduceMean, /image_encoder/norm/Sub, /image_encoder/norm/Mul, /image_encoder/norm/ReduceMean_1, /image_encoder/norm/Add, /image_encoder/norm/Sqrt, /image_encoder/norm/Div, /image_encoder/norm/Mul_1, /image_encoder/norm/Add_1
[03/28/2024-08:27:29] [TRT] [W] Running layernorm after self-attention in FP16 may cause overflow. Exporting the model to the latest available ONNX opset (later than opset 17) to use the INormalizationLayer, or forcing layernorm layers to run in FP32 precision can help with preserving accuracy.

Is the best way to avoid this to set precision of layernorm type to FP32?

Other:

We are not using a saved calibration cache, so we assume that we can safely ignore these messages:

03/27/2024-13:27:14] [TRT] [V] Tactic Name: sm80_xmma_gemm_f16f16_f16f16_f16_tn_n_tilesize64x128x32_stage5_warpsize2x2x1_tensor16x8x16 numSplitK: 2 numBuffers: 2 numKernels: 2 Tactic: 0x00000002040400e1 Time: 0.0224846
[03/27/2024-13:27:14] [TRT] [V] Setting a default quantization params because quantization data is missing for /image_encoder/backbone/stages.2/op_list.2/main/point_conv/conv/Conv

3612 mentions the same issue, and it leads us to believe that when you don't have a cache it is not an issue.

Having very similar issues when casting the network to BF16,

[03/28/2024-13:05:46] [TRT] [V] =============== Computing costs for {ForeignNode[/image_encoder/backbone/stages.0/op_list.0/conv/Conv.../image_encoder/backbone/stages.4/op_list.1/context_module/main/Slice_2]}
[03/28/2024-13:05:46] [TRT] [V] *************** Autotuning format combination: BFloat16(3145728,1048576,1024,1) -> BFloat16(1048576,4096,64,1), BFloat16(524288,1024,32,1), BFloat16(3145728,98304,96,1), BFloat16(1048576,32768,32,1) ***************
[03/28/2024-13:05:46] [TRT] [V] --------------- Timing Runner: {ForeignNode[/image_encoder/backbone/stages.0/op_list.0/conv/Conv.../image_encoder/backbone/stages.4/op_list.1/context_module/main/Slice_2]} (Myelin[0x80000023])
[03/28/2024-13:05:52] [TRT] [V] [MemUsageChange] Subgraph create: CPU +63, GPU +32, now: CPU 2925, GPU 745 (MiB)
[03/28/2024-13:05:53] [TRT] [E] 9: Skipping tactic 0x0000000000000000 due to exception [autotuner.cpp:get_best_tactics:1520] Autotuner: no tactics to implement operation:
 1379: corrltn: /image_encoder/backbone/stages_4/op_list_0/main/depth_conv/conv/Conv_output_0'_before_bias.1-(bf16[1,4096,32,32][]so[], mem_prop=0) | /image_encoder/backbone/stages_4/op_list_0/main/inverted_conv/act/Mul_5_output_0'.1-(bf16[1,4096,64,64][]so[], mem_prop=0), /image_encoder/backbone/stages_4/op_list_0/main/depth_conv/conv/Conv filterWeights-{-0.0
[03/28/2024-13:05:53] [TRT] [V] {ForeignNode[/image_encoder/backbone/stages.0/op_list.0/conv/Conv.../image_encoder/backbone/stages.4/op_list.1/context_module/main/Slice_2]} (Myelin[0x80000023]) profiling completed in 6.42476 seconds. Fastest Tactic: 0xd15ea5edd15ea5ed Time: inf
[03/28/2024-13:05:53] [TRT] [V] *************** Autotuning format combination: BFloat16(1048576,1:8,1024,1) -> BFloat16(131072,1:8,2048,32), BFloat16(65536,1:8,2048,64), BFloat16(393216,1:8,384,4), BFloat16(131072,1:8,128,4) ***************
[03/28/2024-13:05:53] [TRT] [V] --------------- Timing Runner: {ForeignNode[/image_encoder/backbone/stages.0/op_list.0/conv/Conv.../image_encoder/backbone/stages.4/op_list.1/context_module/main/Slice_2]} (Myelin[0x80000023])
[03/28/2024-13:05:53] [TRT] [V] [MemUsageChange] Subgraph create: CPU +39, GPU +32, now: CPU 2934, GPU 875 (MiB)
[03/28/2024-13:05:53] [TRT] [E] 9: Skipping tactic 0x0000000000000000 due to exception [autotuner.cpp:get_best_tactics:1520] Autotuner: no tactics to implement operation:
 1381: corrltn: /image_encoder/backbone/stages_4/op_list_0/main/depth_conv/conv/Conv_output_0'_before_bias.1-(bf16[1,4096,32,32][]so[], mem_prop=0) | /image_encoder/backbone/stages_4/op_list_0/main/inverted_conv/act/Mul_5_output_0'.1-(bf16[1,4096,64,64][]so[], mem_prop=0), /image_encoder/backbone/stages_4/op_list_0/main/depth_conv/conv/Conv filterWeights-{-0.0
[03/28/2024-13:05:53] [TRT] [V] {ForeignNode[/image_encoder/backbone/stages.0/op_list.0/conv/Conv.../image_encoder/backbone/stages.4/op_list.1/context_module/main/Slice_2]} (Myelin[0x80000023]) profiling completed in 0.720026 seconds. Fastest Tactic: 0xd15ea5edd15ea5ed Time: inf
[03/28/2024-13:05:53] [TRT] [W] No valid obedient candidate choices for node {ForeignNode[/image_encoder/backbone/stages.0/op_list.0/conv/Conv.../image_encoder/backbone/stages.4/op_list.1/context_module/main/Slice_2]} that meet the preferred precision. The remaining candidate choices will be profiled.
[03/28/2024-13:05:53] [TRT] [V] *************** Autotuning format combination: Float(3145728,1048576,1024,1) -> Float(1048576,4096,64,1), Float(524288,1024,32,1), Float(3145728,98304,96,1), Float(1048576,32768,32,1) ***************
[03/28/2024-13:05:53] [TRT] [V] --------------- Timing Runner: {ForeignNode[/image_encoder/backbone/stages.0/op_list.0/conv/Conv.../image_encoder/backbone/stages.4/op_list.1/context_module/main/Slice_2]} (Myelin[0x80000023])
[03/28/2024-13:05:54] [TRT] [V] [MemUsageChange] Subgraph create: CPU +40, GPU +32, now: CPU 2936, GPU 1005 (MiB)
[03/28/2024-13:05:54] [TRT] [E] 9: Skipping tactic 0x0000000000000000 due to exception [autotuner.cpp:get_best_tactics:1520] Autotuner: no tactics to implement operation:
 1975: corrltn: /image_encoder/backbone/stages_4/op_list_0/main/depth_conv/conv/Conv_output_0'_before_bias.1-(bf16[1,4096,32,32][]so[], mem_prop=0) | /image_encoder/backbone/stages_4/op_list_0/main/inverted_conv/act/Mul_5_output_0'.1-(bf16[1,4096,64,64][]so[], mem_prop=0), /image_encoder/backbone/stages_4/op_list_0/main/depth_conv/conv/Conv filterWeights-{-0.0
[03/28/2024-13:05:54] [TRT] [V] {ForeignNode[/image_encoder/backbone/stages.0/op_list.0/conv/Conv.../image_encoder/backbone/stages.4/op_list.1/context_module/main/Slice_2]} (Myelin[0x80000023]) profiling completed in 0.968934 seconds. Fastest Tactic: 0xd15ea5edd15ea5ed Time: inf
[03/28/2024-13:05:54] [TRT] [E] 10: Could not find any implementation for node {ForeignNode[/image_encoder/backbone/stages.0/op_list.0/conv/Conv.../image_encoder/backbone/stages.4/op_list.1/context_module/main/Slice_2]}.
[03/28/2024-13:05:54] [TRT] [E] 10: [optimizer.cpp::computeCosts::4048] Error Code 10: Internal Error (Could not find any implementation for node {ForeignNode[/image_encoder/backbone/stages.0/op_list.0/conv/Conv.../image_encoder/backbone/stages.4/op_list.1/context_module/main/Slice_2]}.)

Environment

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.14              Driver Version: 550.54.14      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4090        Off |   00000000:01:00.0 Off |                  Off |
| 40%   31C    P8              5W /  450W |      11MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

TensorRT Version: 9.3.0.post11.dev1

Operating System: Ubuntu 22.04

Baremetal or Container (if so, version): TensorRT-24.02-py

Relevant Files

Code, logs, and .onnx model

https://drive.google.com/drive/folders/1MJAP7NDO7zzRJlUJFexpTcxKVWT9tnuP?usp=sharing

lix19937 commented 6 months ago

You can close bf16 and fp8, and try it. Maybe some nodes have no implementation of those type in current version trt.

zerollzeng commented 6 months ago

We just release TRT 10 EA, could you please try it first?

nvpohanh commented 6 months ago

Just a side note: we do not support FP8 for Convolutions yet

zerollzeng commented 6 months ago

@nvpohanh Currently(TRT 10 EA) only support FP8 for QDQ matmul right? do we support FP8 MHA?

nvpohanh commented 6 months ago

FP8 MHA is supported only for SeqLen<=512

bernardrb commented 6 months ago

@nvpohanh

Where can we find information about layer support?

Searching in the docs we can only see that Dequantize-layer supports fp8.

https://docs.nvidia.com/deeplearning/tensorrt/operators/docs/Dequantize.html

Furthermore, for clarity's sake, setting precisions using tensorrt (i.e. self.config.set_flag(trt.BuilderFlag.OBEY_PRECISION_CONSTRAINTS), and layer.precision = tensorrt.tensorrt.DataType, and layer.set_output_type(self: tensorrt.tensorrt.ILayer, index: int, dtype: tensorrt.tensorrt.DataType) allows controlling the layer-wise computational and output precision. Thereby not requiring using pytorch-quantization for achieving explicit quantization?

zerollzeng commented 6 months ago

@nvpohanh I think we should add fp8 layer support to our developer guide.