Closed FrancescoSaverioZuppichini closed 1 year ago
8 weights are affected by this issue: Detected subnormal FP16 values.
Can you remove the half()
in model = convnext_small(ConvNeXt_Small_Weights.IMAGENET1K_V1).eval().half().cuda()
? this warning means you model contain weight that exceed the FP16 range(e.g greater than 65504).
Your ONNX model has been generated with INT64 weights most of the ONNX models has INT64 weights, and TRT only supports INT32 weights. Usually, this is not a problem because your weights are unlikely larger than max INT32(-2,147,483,648 to +2,147,483,647). so this warning can be ignored.
Hi @zerollzeng thanks for the reply. I would like to keep my model in half-precision mode, any advice on how to achieve it while removing the error?
you can retrain your model with FP16 so all the weights can have FP16 range, or add some normalization method on the subnormal weights.
Retraining is not possible. Is there a way to serve with tensorrt a model with mixed precision? My task is very simple, I have a model in PyTorch, I exported it in mixed precision and I want to run it with tensorrt, nothing really crazy. I am just evaluating the technology.
If I remove the .half
will the model be still in mixed precision? Not sure if torch.autocast
will make it sure mostly is still in float16.
One thing that I don't understand is where the INT64 weights are coming from, any idea?
Thanks a lot :)
hey @zerollzeng I am trying everything here. See https://github.com/microsoft/onnxconverter-common/issues/251 and https://github.com/microsoft/onnxconverter-common/issues/252 but they are not tensorrt related. Have you ever create a mixed precision model in onnx?
Yes mixed precision is feasible, just export the onnx to FP32, and when you build the engine, you can enable FP16(you don't need to export FP16 model at first) and specify those subnormal layers to FP32.
@zerollzeng how do I do it? I used the functions from onnx doc but they are not working. I'd like to have an automatic way to find the subnormal layers. If you have experience with it, do you mind sharing a snippet?
you can refer to https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#layer-level-control API: https://docs.nvidia.com/deeplearning/tensorrt/api/c_api/classnvinfer1_1_1_i_layer.html
If you are using trtexec, you can specify per-layer precision with --layerPrecisions
and --precisionConstraints
. see trtexec -h
@zerollzeng I think I didn't explain myself. I know this is not Tensorrt related, but ONNX related now so off topic.
So, nobody online has ever exported a pytorch model to mixed precision in ONNX and has documented the process. Thus, I was wondering if you know how to do it :)
We don't support building engines that can follow the ONNX precision now. maybe they are in our roadmap but I'm not sure. @kevinch-nv Do you know about it?
For now I think you can only mark the per-layer precision manually with TRT API, I think it's possible to write some automated code to parse the original ONNX precision and mark them with TensorRT automatically, but I don't have one in my hand :-(.
@zerollzeng thanks for the reply :) Sorry for my noobiness but how does it work internally? Assuming I have an onnx graph with mixed precision nodes (float32 and float16), should that be enough for Tensorrt?
Yes it's enough, for example you can use https://github.com/NVIDIA/TensorRT/tree/master/tools/onnx-graphsurgeon. you can write a python script to read the onnx and iterate over all operators. when you see it's precision is FP16, set the trt layer to FP16. then you enable both of the FP32 and FP16 precision. you will get a mixed precision trt engine finally.
Thanks @zerollzeng , so the current Tensorrt limitation is that it cannot accept onnx models with both FP16 and FP32 nodes and I have to export the model to FP32 and manually remap all the FP32 to FP16 once I've loaded the model in tensorrt?
In your experience, is it better to should run it everything in FP16 or mixed precision actually have an advantage? Looks like is super hard to achieve in real life
In your experience, is it better to should run it everything in FP16 or mixed precision actually have an advantage? Looks like is super hard to achieve in real life
I would suggest exporting the onnx to FP32 and run it everything in FP16 with tensorrt first, it will offer you better performance and most likely the accuracy will just degrade a little(maybe become better 🥇 ). if the accuracy is not good(e.g. some layer outputs or weights will exceed FP16 range and you will see a warning in the log), then you can offload those layers back to FP32, it will give you a good balance between performance and accuracy.
You can also try INT8 if you want the best performance. we support QAT and PTQ for INT8 quantization.
but you can't do INT8 on an a GPU or can you?
GPU with compute capability >= 6.1 support INT8, a quick way to check this is use trtexec: trtexec --onnx=model.onnx --int8
or trtexec --onnx=model.onnx --int8 --fp16
(this will enable int8 and fp16 and thus provide the best performance)
damn, really? I didn't know that!
how did you get the onnx file?
hey @hannah12356 it's generated from the code on the first message on this issue :)
I'm having a similar issue in YoloV5! I convert the yolov5s from '.pt' to '.engine' (FP32), their inference results are inconsistent.
https://github.com/ultralytics/yolov5/issues/11255#issuecomment-1486282655 @zerollzeng @FrancescoSaverioZuppichini
I'm closing this due to no left action.
@wangjl1993 I just saw someone is helping you in the yolov5 repo. if you find TRT accuracy issues please file a new issue against TensorRT OSS. Thanks!
FYI: we can't said the accuracy is bad until we test on some validation dataset, it might be due to fluctuation in some images.
@zerollzeng why this issue was closed?
@FrancescoSaverioZuppichini please reopen it if you have any further questions :)
Description
Hello There!
I hope you are all doing well :)
There are other similar issues but no even one of them has a fix to this problem.
Tensorrt takes a lot of time casting IN64 to IN32 making it impossible to use in real life
my conversion code
The code takes around 3/4m to convert the weights, then it outputs the following
Environment
I am using the latest nvidia container TensorRT Version: 8.5.1 ONNX-TensorRT Version / Branch: GPU Type: RTX 3090 Nvidia Driver Version: CUDA Version: 11.7 CUDNN Version: Operating System + Version: Python Version (if applicable): 3.8 TensorFlow + TF2ONNX Version (if applicable): 1.13 PyTorch Version (if applicable): Baremetal or Container (if container which image + tag):
Relevant Files
my onnx model ( a convnext)
link to drive
Steps To Reproduce
Copy this code