Open AmazDeng opened 3 days ago
@rajeevsrao @ttyio @pranavm-nvidia @aaronp24 @ilyasher Could you please take a look at this issue?
The problem is that trtexec
will use random scaling factors for int8
mode. If you replace --best
with --fp16
(i.e. disable --int8
), that should improve the accuracy.
The problem is that
trtexec
will use random scaling factors forint8
mode. If you replace--best
with--fp16
(i.e. disable--int8
), that should improve the accuracy.
@pranavm-nvidia
Thanks for your reply.
I recompiled the engine using the code below, but the inference results from the TensorRT engine are still different from those of Hugging Face. I configured FP16 and did not specify INT8. In this case, INT8 should be disabled. So why are the results still different?
MODEL_NAME="InternVL2-40B"
OUTPUT_MODEL_NAME="InternVL2_40B"
onnx_process_version="onnx_v1"
max_batch_size=24
onnx_dtype="float16"
trt_dtype="fp16"
/usr/src/tensorrt/bin/trtexec \
--onnx=/data/eas/visual_engine/a100/InternViT-6B-448px-V1-5/onnx/visual_encoder.onnx \
--saveEngine=/data/eas/visual_engine/a100/InternViT-6B-448px-V1-5/visual_encoder.trtexec.${trt_dtype}.maxBatchSize${max_batch_size}.engine \
--minShapes=input:1x3x448x448 \
--optShapes=input:8x3x448x448 \
--maxShapes=input:24x3x448x448 \
--fp16
hf+float16
outputs.last_hidden_state=tensor([[[ 1.0576, -4.4062, 1.1816, ..., 0.4963, 0.5752, 0.4436],
[ 3.6680, 4.8086, 4.7578, ..., -14.2969, 6.4336, -12.0312],
[ 3.9355, 4.4805, 4.4922, ..., -14.7031, 6.2812, -11.2266],
...,
[ -2.5684, -2.8164, 5.3242, ..., -7.0508, 0.2556, -6.5859],
[ -6.5156, -6.5859, 9.9531, ..., -4.0938, -4.5703, -14.6719],
[ -6.2383, -6.2930, 10.0391, ..., -3.8965, -4.2891, -15.1016]]],
device='cuda:0', dtype=torch.float16, grad_fn=<AddBackward0>)
outputs.last_hidden_state.shape=torch.Size([1, 1025, 3200])
tensorrt fp16
outputs_trt.shape=torch.Size([1025, 3200])
outputs_trt=tensor([[ 1.6475, -3.5586, 2.3145, ..., 0.0755, 0.9883, -0.1611],
[ 4.3867, 3.2012, 5.6523, ..., -10.9531, 6.0000, -10.7500],
[ 5.1836, 3.0391, 5.4883, ..., -11.1641, 6.1055, -9.7188],
...,
[ -0.2261, -3.4922, 5.6211, ..., -4.0312, 0.1794, -5.6328],
[ -3.2559, -5.6719, 8.9219, ..., -4.8320, -3.1484, -11.0000],
[ -3.1094, -5.3867, 9.0156, ..., -4.7812, -3.0059, -11.5312]],
same issue, you can set flash_attn to false and use bf16 to compile, it works for me
same issue, you can set flash_attn to false and use bf16 to compile, it works for me
@seanxcwang
I followed the method you provided for testing. In the hf -> onnx step, I set use_flash_attn=False and loaded the model with torch.bfloat16.
model = AutoModel.from_pretrained(
model_path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
use_flash_attn=False,
trust_remote_code=True).cuda().eval()
In the onnx -> trt stage, I tried both --fp16 and --best settings, but the result was the same: the difference between TRT and ONNX inference results remains significant.
MODEL_NAME="InternVL2-40B"
OUTPUT_MODEL_NAME="InternVL2_40B"
onnx_process_version="onnx_v1"
max_batch_size=24
onnx_dtype="float16"
trt_dtype="best"
/usr/src/tensorrt/bin/trtexec \
--onnx=/data/eas/visual_engine/a100/InternViT-6B-448px-V1-5/bfloat16/onnx/visual_encoder.onnx \
--saveEngine=/data/eas/visual_engine/a100/InternViT-6B-448px-V1-5/bfloat16/visual_encoder.trtexec.${trt_dtype}.maxBatchSize${max_batch_size}.engine \
--minShapes=input:1x3x448x448 \
--optShapes=input:8x3x448x448 \
--maxShapes=input:24x3x448x448 \
--${trt_dtype}
Did you compile following these steps?
I found that bfloat16 is not required, but use_flash_attn must be set to false when export onnx, and stronglyTyped should be added when convert to trt engine. by the way,I use python api to compile the model
@seanxcwang
I found that the following section of code in the Hugging Face model caused my TRT engine model export to be in the float32 format, which ensures that the inference results between TRT and HF remain consistent. If fp16 or best is configured, the results will not be consistent. However, inference with float32 is quite slow, so I am currently looking for a solution.
class InternRMSNorm(nn.Module):
def __init__(self, hidden_size, eps=1e-6):
super().__init__()
self.weight = nn.Parameter(torch.ones(hidden_size))
self.variance_epsilon = eps
def forward(self, hidden_states):
input_dtype = hidden_states.dtype
hidden_states = hidden_states.to(torch.float32)
variance = hidden_states.pow(2).mean(-1, keepdim=True)
hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
return self.weight * hidden_states.to(input_dtype)
Description
I attempted to compile a Hugging Face model (the Hugging Face model link is: https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-5, which includes both the model architecture code and model files) using TensorRT (TRT) to improve inference speed. The steps I followed are hf -> onnx -> trt.
I performed inference on the same image using Hugging Face (hf), ONNX, and TRT engine. I found that the inference results from hf and ONNX were consistent, but the TRT engine's result was different from the former two.
I would like to know why the ONNX results are correct, but the inference results from the engine compiled with trtexec are wrong. Why is this happening?
The conversion code from hf to ONNX is:
The conversion code from ONNX to TRT engine is:
The inference code for hf is:
The inference code for ONNX is:
The inference code for TRT engine is:
The inference results are as follows:
Environment
TensorRT Version:v100500
NVIDIA GPU:A100
NVIDIA Driver Version:535.54.03
CUDA Version:12.2
CUDNN Version:8920
Operating System: docker mirror: nvidia_cuda_12.4.0-devel-ubuntu22.04
Python Version (if applicable):3.10.12
Tensorflow Version (if applicable):
PyTorch Version (if applicable):2.2.2+cu121
Baremetal or Container (if so, version):docker
Relevant Files
Model link: https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-5 internvn2_40b_image2_patch1.npy: internvn2_40b_image2_patch1.zip onnx file link: https://drive.google.com/file/d/1lnEmuQ4cNzf8YA7ddznqUnYsz-W5y5aJ/view?usp=sharing