Qwen-VL-Chat vit embedding diff

bnuzhanyu commented 4 months ago

Problem

For same input image , I get different output of the visual embedding, and this could make the result a little bit worse than original model.

env

tensorrt-llm 0.9.0, GPU: A10 model: https://modelscope.cn/models/qwen/Qwen-VL-Chat/summary

Qwen-VL-Chat build command

MAX_BATCH_SIZE=8
HF_MODEL_DIR=$MODEL_ROOT_DIR/Qwen-VL-Chat
ONNX_FILE=$MODEL_ROOT_DIR/visual_encoder/visual_encoder.onnx
PLAN_FILE=$MODEL_ROOT_DIR/plan/visual_encoder/visual_encoder_fp16.plan
CHECKPOINT_DIR=$MODEL_ROOT_DIR/qwen_vl_trt/checkpoint
ENGINE_DIR=$MODEL_ROOT_DIR/qwen_vl_trt/engine
MAX_INPUT_LEN=2048
MAX_OUTPUT_LEN=1024
MAX_PROMPT_EMBEDDING_TABLE_SIZE=$((MAX_BATCH_SIZE * 256))

export CUDA_VISIBLE_DEVICES=1
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

if [ ! -f $PLAN_FILE ]; then
    CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES python3 vit_onnx_trt.py --pretrained_model_path $HF_MODEL_DIR \
                --onnxFile $ONNX_FILE --planFile $PLAN_FILE --maxBS $MAX_BATCH_SIZE
fi

CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES python3 ../qwen/convert_checkpoint.py --model_dir=$HF_MODEL_DIR --output_dir=$CHECKPOINT_DIR

CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES trtllm-build --checkpoint_dir=$CHECKPOINT_DIR \
             --gemm_plugin=float16 --gpt_attention_plugin=float16 \
             --lookup_plugin=float16 --max_input_len=$MAX_INPUT_LEN --max_output_len=$MAX_OUTPUT_LEN \
             --max_batch_size=$MAX_BATCH_SIZE --max_prompt_embedding_table_size=$MAX_PROMPT_EMBEDDING_TABLE_SIZE \
             --remove_input_padding=enable \
             --output_dir=$ENGINE_DIR

Tensorrt-llm Print input image and output embedding:

stream = torch.cuda.current_stream().cuda_stream
image_npy = self.image_preproc.encode([image_path])  # Preprocess
images = torch.cat([image_npy]).to(self.device)  #([bs, 3, 448, 448])
batch_size = images.size(0)
images = images.expand(batch_size, -1, -1, -1).contiguous()
print(images.shape)
print(images)
visual_inputs = {'input': images.float()}
visual_output_info = self.vit.infer_shapes(
    [TensorInfo('input', trt.DataType.FLOAT, images.shape)])
visual_outputs = {
    t.name: torch.empty(tuple(t.shape),
                        dtype=trt_dtype_to_torch(t.dtype),
                        device='cuda')
    for t in visual_output_info
}
ok = self.vit.run(visual_inputs, visual_outputs, stream)
assert ok, "Runtime execution failed for vit session"
image_embeds = visual_outputs['output']  # [bs, 256, 4096]
print(image_embeds)

Different output using exmaples/qwenvl/pics/demo.jpeg

TensorrtLLM output

# image
torch.Size([1, 3, 448, 448])
tensor([[[[ 0.8647,  0.9084,  0.9230,  ...,  1.7552,  1.7552,  1.7552],
          [ 0.8792,  0.9376,  0.9376,  ...,  1.7552,  1.7552,  1.7552],
          [ 0.9230,  0.9230,  0.9376,  ...,  1.7552,  1.7552,  1.7552],
          ...,
          [-0.7704, -0.7704, -0.7412,  ..., -0.2886, -0.3178, -0.3908],
          [-0.7558, -0.7558, -0.7558,  ..., -0.3470, -0.4054, -0.4492],
          [-0.7558, -0.7558, -0.7704,  ..., -0.4054, -0.4492, -0.4930]],

         [[ 1.2194,  1.2495,  1.2645,  ...,  1.8948,  1.8948,  1.8948],
          [ 1.2344,  1.2495,  1.2795,  ...,  1.8948,  1.8948,  1.8948],
          [ 1.2344,  1.2645,  1.2945,  ...,  1.8948,  1.8948,  1.8948],
          ...,
          [-0.5815, -0.5815, -0.5515,  ..., -0.3564, -0.3714, -0.4464],
          [-0.5665, -0.5665, -0.5515,  ..., -0.3864, -0.4614, -0.5065],
          [-0.5665, -0.5815, -0.6115,  ..., -0.4464, -0.4914, -0.5515]],

         [[ 1.2927,  1.3211,  1.3354,  ...,  1.9753,  1.9753,  1.9753],
          [ 1.3069,  1.3354,  1.3496,  ...,  1.9753,  1.9753,  1.9753],
          [ 1.3211,  1.3354,  1.3638,  ...,  1.9753,  1.9753,  1.9753],
          ...,
          [-0.3426, -0.3284, -0.3000,  ..., -0.2146, -0.2431, -0.3284],
          [-0.3142, -0.3142, -0.2857,  ..., -0.2573, -0.3284, -0.3711],
          [-0.3142, -0.3284, -0.3568,  ..., -0.3284, -0.3711, -0.4137]]]],
       device='cuda:0')

# embedding
torch.Size([1, 256, 4096])
tensor([[[ 2.2480, -1.3076, -0.6943,  ..., -0.2272, -3.1777, -1.1953],
         [ 3.5098,  0.3196,  1.2432,  ...,  1.5215, -1.5166, -1.1787],
         [ 1.5010, -2.9395, -1.0654,  ...,  3.9258, -0.6914, -0.2371],
         ...,
         [ 0.5205, -2.0645, -0.0531,  ...,  3.9160,  0.8760,  6.0273],
         [ 1.4053, -2.9629, -0.0939,  ...,  1.6025,  1.9092,  1.5703],
         [-1.9521, -2.8320, -2.5430,  ...,  5.6758,  0.3870,  3.1934]]],
       device='cuda:0', dtype=torch.float16)

Qwen-VL-Chat ModelScope

# image
torch.Size([1, 3, 448, 448])
tensor([[[[ 0.8647,  0.9084,  0.9230,  ...,  1.7552,  1.7552,  1.7552],
          [ 0.8792,  0.9376,  0.9376,  ...,  1.7552,  1.7552,  1.7552],
          [ 0.9230,  0.9230,  0.9376,  ...,  1.7552,  1.7552,  1.7552],
          ...,
          [-0.7704, -0.7704, -0.7412,  ..., -0.2886, -0.3178, -0.3908],
          [-0.7558, -0.7558, -0.7558,  ..., -0.3470, -0.4054, -0.4492],
          [-0.7558, -0.7558, -0.7704,  ..., -0.4054, -0.4492, -0.4930]],

         [[ 1.2194,  1.2495,  1.2645,  ...,  1.8948,  1.8948,  1.8948],
          [ 1.2344,  1.2495,  1.2795,  ...,  1.8948,  1.8948,  1.8948],
          [ 1.2344,  1.2645,  1.2945,  ...,  1.8948,  1.8948,  1.8948],
          ...,
          [-0.5815, -0.5815, -0.5515,  ..., -0.3564, -0.3714, -0.4464],
          [-0.5665, -0.5665, -0.5515,  ..., -0.3864, -0.4614, -0.5065],
          [-0.5665, -0.5815, -0.6115,  ..., -0.4464, -0.4914, -0.5515]],

         [[ 1.2927,  1.3211,  1.3354,  ...,  1.9753,  1.9753,  1.9753],
          [ 1.3069,  1.3354,  1.3496,  ...,  1.9753,  1.9753,  1.9753],
          [ 1.3211,  1.3354,  1.3638,  ...,  1.9753,  1.9753,  1.9753],
          ...,
          [-0.3426, -0.3284, -0.3000,  ..., -0.2146, -0.2431, -0.3284],
          [-0.3142, -0.3142, -0.2857,  ..., -0.2573, -0.3284, -0.3711],
          [-0.3142, -0.3284, -0.3568,  ..., -0.3284, -0.3711, -0.4137]]]])

# embedding
torch.Size([1, 256, 4096])
tensor([[[ 0.5669,  2.1602,  0.5522,  ...,  1.6719, -1.1719,  0.6343],
         [-0.2158, -1.9053, -1.3213,  ..., -0.2773, -1.0303, -1.5508],
         [ 0.4644,  0.0384, -2.0176,  ...,  2.0605,  0.4480, -1.5918],
         ...,
         [ 0.0100, -1.0996,  0.6797,  ...,  6.4961, -1.7705,  4.5273],
         [-1.6621, -2.1875,  0.3442,  ...,  2.1309,  1.2607,  3.2891],
         [-2.6719, -2.6094, -2.9102,  ...,  2.0137,  2.4043,  0.7583]]],

The input image is same, but the output embeddings are not close.

sunnyqgg commented 3 months ago

Hi @bnuzhanyu, you got the "TensorrtLLM output" from the source code of "Tensorrt-llm Print input image and output embedding:", right? And how do you got the "Qwen-VL-Chat ModelScope" results? If the ViT engine and the input are same, the results are expected the same.

bnuzhanyu commented 3 months ago

Hi @bnuzhanyu, you got the "TensorrtLLM output" from the source code of "Tensorrt-llm Print input image and output embedding:", right? And how do you got the "Qwen-VL-Chat ModelScope" results? If the ViT engine and the input are same, the results are expected the same.

Yes, I modified trtllm source to get "TensorrtLLM output".
I use the code https://modelscope.cn/models/qwen/Qwen-VL-Chat/file/view/master?fileName=modeling_qwen.py&status=1
```
# line 565
images = self.visual.encode(images) 
```
to get the "Qwen-VL-Chat ModelScope" results

calico-niko commented 2 months ago

any update?

sunnyqgg commented 2 months ago

Hi @calico-niko @bnuzhanyu The ViT is offloaded to TRT, and the fp32 accuracy of it on TRT9.3 is alined with Pytorch. And you can also change the version of TRT from 9.3 to 10.x, the fp16 accuracy of it on TRT10.x is fine.

hezeli123 commented 1 month ago

Hi @calico-niko @bnuzhanyu The ViT is offloaded to TRT, and the fp32 accuracy of it on TRT9.3 is alined with Pytorch. And you can also change the version of TRT from 9.3 to 10.x, the fp16 accuracy of it on TRT10.x is fine.

I updated trt to 10.0.1 , but got the same diff. I think the results vary a lot.

vit by trt: tensor([[[ 1.1836, -0.6230, 0.5547, ..., -0.2083, -1.2617, -3.3867], [ 0.4189, -0.3127, -0.5732, ..., 0.8232, 1.0723, 0.8164], [-0.4614, -0.0329, 0.7266, ..., 0.3604, -1.1826, -0.0151], ..., [-0.3298, 1.5420, 1.1074, ..., 1.4434, -0.7012, -2.1191], [ 0.2686, -0.4331, -2.0234, ..., -0.1218, -0.9346, -0.0122], [-1.3047, 0.8560, -2.2266, ..., -0.5923, 1.6758, 0.3738]]], device='cuda:0', dtype=torch.float16)

vit by visual.encode(images): tensor([[ 1.2646, -0.6035, 0.6665, ..., -0.1432, -1.2197, -3.3691], [ 0.2174, -0.3809, -1.2285, ..., 1.6143, 0.8530, 0.3674], [-0.4709, -0.0422, 0.7241, ..., 0.3362, -1.1611, -0.0426], ..., [-0.3340, 1.5625, 1.0889, ..., 1.5625, -0.6138, -2.0215], [ 0.2142, -0.3896, -2.0020, ..., -0.0946, -0.9102, 0.0299], [-1.3281, 0.9053, -2.2617, ..., -0.5645, 1.7812, 0.4031]], device='cuda:0', dtype=torch.float16)

byshiue commented 1 month ago

@sunnyqgg Do you have any other suggestion?

sunnyqgg commented 1 month ago

Hi @hezeli123 , the diffs are smaller compared with TRT 9.x, does the current ViT diffs have a big impact on the final results? If so, you can try to run ViT with FP32 precision.

hezeli123 commented 1 month ago

The current ViT diffs have a big impact which results in many bad cases. I run ViT with FP32 precision now.

sunnyqgg commented 1 month ago

OK， if you have strong desire to use FP16, I'll continue to look at this issue, if not, this issue will have a lower priority.

NVIDIA / TensorRT-LLM