batch inference is different with single

NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.

https://nvidia.github.io/TensorRT-LLM

Apache License 2.0

8.33k stars 936 forks source link

batch inference is different with single #1879

Open 1096125073 opened 3 months ago

1096125073 commented 3 months ago

System Info

x85-64 4 A10 0.9.0

Who can help?

No response

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

A private model of type llama2, when using the same input batch inference, such as batch_size=4, yields four different answers. （topk=0,topp=0,run.py）

Expected behavior

four answers should be the same

actual behavior

yields four different answers.

additional notes

1096125073 commented 3 months ago

i have disable custom_all_reduce when build engine

QiJune commented 3 months ago

Hi @1096125073 , since different batch sizes may lead to different kernels. So, the results can be different. This is a known issue.

1096125073 commented 3 months ago

Hi @1096125073 , since different batch sizes may lead to different kernels. So, the results can be different. This is a known issue.

Thank you for your answer! I'm sorry, I think it might be because I didn't express myself clearly. When I infer that the batch is 4, the input to the batch is the same, but the four outputs I get are different.

QiJune commented 3 months ago

@1096125073 Yes, I get your point: repeat the same input prompt 4 times, and make it a batch, but the outputs are different from batch size 1. Unfortunately, it's a known issue.

BTW, do you observe the similar phenomenon in PyTorch?

1096125073 commented 3 months ago

@1096125073 Yes, I get your point: repeat the same input prompt 4 times, and make it a batch, but the outputs are different from batch size 1. Unfortunately, it's a known issue.

BTW, do you observe the similar phenomenon in PyTorch?

Sorry, I said these four outputs are different from each other, just like the picture above.

QiJune commented 3 months ago

Hi @1096125073 , I tried the llama2 model:

python convert_checkpoint.py --model_dir=/llm-models/llama-models-v2/llama-v2-7b-hf/ --output_dir=./ckpt --dtype bfloat16

trtllm-build --checkpoint_dir=./ckpt --output_dir=./engine --gemm_plugin bfloat16 --max_output_len=256 --max_batch_size=4

python ../run.py --engine_dir=./engine --max_output_len=10 --tokenizer_dir /llm-models/llama-models-v2/llama-v2-7b-hf/  --input_text 'How are you' 'How are you' 'How are you' 'How are you'

Here is the result:

Input [Text 0]: "<s> How are you"
Output [Text 0 Beam 0]: "doing? I hope you are doing well. I"
Input [Text 1]: "<s> How are you"
Output [Text 1 Beam 0]: "doing? I hope you are doing well. I"
Input [Text 2]: "<s> How are you"
Output [Text 2 Beam 0]: "doing? I hope you are doing well. I"
Input [Text 3]: "<s> How are you"
Output [Text 3 Beam 0]: "doing? I hope you are doing well. I"

QiJune commented 3 months ago

@1096125073 Could you please try the main branch? It seems you are using 0.9.0 version.

yuxianq commented 3 months ago

@1096125073 Do you use multiple GPUs? If you use multi-GPU, you can use NCCL_ALGO=Tree to ensure stable reduce order. NCCL usually select Ring algo, which has unstable reduce order, which causes different results in the same batch. If you use single GPU, then it should be other issue.

0xd8b commented 1 month ago

@QiJune I encountered the same issue with the T5 model (float16). The inference results vary slightly with different batch sizes during extensive sample testing. Is this a normal phenomenon? I saw a similar issue in one of the issues on this page: https://github.com/dmlc/gluon-nlp/issues/1344.

1096125073 commented 1 month ago

@1096125073 Do you use multiple GPUs? If you use multi-GPU, you can use NCCL_ALGO=Tree to ensure stable reduce order. NCCL usually select Ring algo, which has unstable reduce order, which causes different results in the same batch. If you use single GPU, then it should be other issue.

Yes，this is the answer i want，thanks

chiendb97 commented 1 month ago

@QiJune I'm experiencing issues even when using single GPU. If the discrepancies in results are due to varying kernel choices, is there a way to sacrifice some performance in exchange for more stability?