Open AmazDeng opened 2 months ago
Additionally, I also tried passing the do_sample=False parameter; the inference didn't throw any errors, but it was ineffective, as the results within a batch were not exactly identical.
I ran the batch inference loop ten times. I found that although the results within each batch varied, the results between batches were exactly the same. So, where is the randomness in the tensorrt-llm generate method manifested? I now hope that the inference results within the same batch for the same prompt are completely identical.
In TensorRT-LLM, if you don't setup beam_width (default value is 1), then it uses sampling. Under sampling, you could use top_k, top_p to control the sampling. If you set top_k = 1, it will use greedy search.
If you set beam_width > 1, then TRT-LLM will use beam search and ignore the top_k and top_p values.
In TensorRT-LLM, if you don't setup beam_width (default value is 1), then it uses sampling. Under sampling, you could use top_k, top_p to control the sampling. If you set top_k = 1, it will use greedy search.
If you set beam_width > 1, then TRT-LLM will use beam search and ignore the top_k and top_p values.
So, there are two ways to set parameters:
Not fully correct. To achieve do_sample=False
, you should set top_k=1
and beam_width=1
at the same time.
I tested the batch inference results of the llava and llava-next-video models using tensorrt-llm based on the examples/multimodal/run.py file. The parameters for their generate method are the same, as follows. Specifically, I used the default parameters in the generate method without any modifications. I set the batch size to 8. My question is: the batch inference results of the llava model are exactly the same for all 8 outputs. For the llava-next-video model, the 8 results are different. I want the batch inference results of the llava-next-video model to be exactly the same. How should I set the parameters to achieve the effect of the do_sample=False setting in the Hugging Face transformers model.generate method?