Closed bernardrb closed 5 months ago
You can close bf16 and fp8, and try it. Maybe some nodes have no implementation of those type in current version trt.
We just release TRT 10 EA, could you please try it first?
Just a side note: we do not support FP8 for Convolutions yet
@nvpohanh Currently(TRT 10 EA) only support FP8 for QDQ matmul right? do we support FP8 MHA?
FP8 MHA is supported only for SeqLen<=512
@nvpohanh
Where can we find information about layer support?
Searching in the docs we can only see that Dequantize-layer supports fp8.
https://docs.nvidia.com/deeplearning/tensorrt/operators/docs/Dequantize.html
Furthermore, for clarity's sake, setting precisions using tensorrt (i.e. self.config.set_flag(trt.BuilderFlag.OBEY_PRECISION_CONSTRAINTS)
, and layer.precision = tensorrt.tensorrt.DataType
, and layer.set_output_type(self: tensorrt.tensorrt.ILayer, index: int, dtype: tensorrt.tensorrt.DataType)
allows controlling the layer-wise computational and output precision. Thereby not requiring using pytorch-quantization for achieving explicit quantization?
@nvpohanh I think we should add fp8 layer support to our developer guide.
Description
We are trying to experiment with mixed-precision with different precision formats. In this build, we were attempting to cast one "stage" to FP8, and casting the rest to FP16. We were using PREFER_PRECISION_CONSTRAINTS. In the Google Drive link, is relevant code, failed build logs, .onnx-model, and polygraphy inspections of the .onnx.
Errors:
From https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#error-messaging,
But in issue #2035, the workspace should be infinity, and doesn't not have to be configured? The issue seems to be discussed on onnx/ too. https://github.com/onnx/onnx-tensorrt/issues/758
Cannot find any information on Skipping tactic 0x0000000000000000 due to exception Unexpected type in resetWeightsTypeIfFP, but we imagine it has to do with setting precision of layers. Are there limitations to using mixed-precision, in our case FP8, and FP16? Our precision settings were FP16 (stages.0) -> FP16 (stages.1) -> FP16 (stages.2)-> FP8 (stages.3) -> FP16 (stages.4) -> FP16 (stages.5) -> FP16 (neck). Note, the "stages.x" are part of the layer.names to differentiate between different blocks.
The troublesome layer ForeignNode[/image_encoder/backbone/stages.3/op_list.4/main/spatial_conv/act/Constant_3_output_0 + ONNXTRT_Broadcast_101.../image_encoder/backbone/stages.3/op_list.4/Add], and the adjacent layers are set to prefer FP8,
There is also a fusion that is ran,
Here is the original .onnx graph, with the layers we assume are causing the issue.
How can we avoid this issue? Having tried enabling the FP8 flag for the entire network, and facing the same issue for a different layer, I assume it has to do with FP8. Is there a way to avoid this issue?
Looking at onnx Mul operation for example, it might be that FP8 is not supported. Type Constraints T in ( tensor(bfloat16), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8) ):
Also, is there a way to see the complete ForeignNode as it seems to hide some of the layers for brevity: ForeignNode[/image_encoder/backbone/stages.3/op_list.4/main/spatial_conv/act/Constant_3_output_0 + ONNXTRT_Broadcast_101.../image_encoder/backbone/stages.3/op_list.4/Add]
Warnings:
Is the best way to avoid this to set precision of layernorm type to FP32?
Other:
We are not using a saved calibration cache, so we assume that we can safely ignore these messages:
3612 mentions the same issue, and it leads us to believe that when you don't have a cache it is not an issue.
Having very similar issues when casting the network to BF16,
Environment
TensorRT Version: 9.3.0.post11.dev1
Operating System: Ubuntu 22.04
Baremetal or Container (if so, version): TensorRT-24.02-py
Relevant Files
Code, logs, and .onnx model
https://drive.google.com/drive/folders/1MJAP7NDO7zzRJlUJFexpTcxKVWT9tnuP?usp=sharing