Open 1096125073 opened 3 months ago
i have disable custom_all_reduce when build engine
Hi @1096125073 , since different batch sizes may lead to different kernels. So, the results can be different. This is a known issue.
Hi @1096125073 , since different batch sizes may lead to different kernels. So, the results can be different. This is a known issue.
Thank you for your answer! I'm sorry, I think it might be because I didn't express myself clearly. When I infer that the batch is 4, the input to the batch is the same, but the four outputs I get are different.
@1096125073 Yes, I get your point: repeat the same input prompt 4 times, and make it a batch, but the outputs are different from batch size 1. Unfortunately, it's a known issue.
BTW, do you observe the similar phenomenon in PyTorch?
@1096125073 Yes, I get your point: repeat the same input prompt 4 times, and make it a batch, but the outputs are different from batch size 1. Unfortunately, it's a known issue.
BTW, do you observe the similar phenomenon in PyTorch?
Sorry, I said these four outputs are different from each other, just like the picture above.
Hi @1096125073 , I tried the llama2 model:
python convert_checkpoint.py --model_dir=/llm-models/llama-models-v2/llama-v2-7b-hf/ --output_dir=./ckpt --dtype bfloat16
trtllm-build --checkpoint_dir=./ckpt --output_dir=./engine --gemm_plugin bfloat16 --max_output_len=256 --max_batch_size=4
python ../run.py --engine_dir=./engine --max_output_len=10 --tokenizer_dir /llm-models/llama-models-v2/llama-v2-7b-hf/ --input_text 'How are you' 'How are you' 'How are you' 'How are you'
Here is the result:
Input [Text 0]: "<s> How are you"
Output [Text 0 Beam 0]: "doing? I hope you are doing well. I"
Input [Text 1]: "<s> How are you"
Output [Text 1 Beam 0]: "doing? I hope you are doing well. I"
Input [Text 2]: "<s> How are you"
Output [Text 2 Beam 0]: "doing? I hope you are doing well. I"
Input [Text 3]: "<s> How are you"
Output [Text 3 Beam 0]: "doing? I hope you are doing well. I"
@1096125073 Could you please try the main branch? It seems you are using 0.9.0 version.
@1096125073 Do you use multiple GPUs? If you use multi-GPU, you can use NCCL_ALGO=Tree to ensure stable reduce order. NCCL usually select Ring algo, which has unstable reduce order, which causes different results in the same batch. If you use single GPU, then it should be other issue.
@QiJune I encountered the same issue with the T5 model (float16). The inference results vary slightly with different batch sizes during extensive sample testing. Is this a normal phenomenon? I saw a similar issue in one of the issues on this page: https://github.com/dmlc/gluon-nlp/issues/1344.
@1096125073 Do you use multiple GPUs? If you use multi-GPU, you can use NCCL_ALGO=Tree to ensure stable reduce order. NCCL usually select Ring algo, which has unstable reduce order, which causes different results in the same batch. If you use single GPU, then it should be other issue.
Yes,this is the answer i want,thanks
@QiJune I'm experiencing issues even when using single GPU. If the discrepancies in results are due to varying kernel choices, is there a way to sacrifice some performance in exchange for more stability?
System Info
x85-64 4 A10 0.9.0
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
A private model of type llama2, when using the same input batch inference, such as batch_size=4, yields four different answers. (topk=0,topp=0,run.py)
Expected behavior
four answers should be the same
actual behavior
yields four different answers.
additional notes