NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.18k stars 908 forks source link

how to set do_sample=False? #1899

Open AmazDeng opened 2 months ago

AmazDeng commented 2 months ago

I tested the batch inference results of the llava and llava-next-video models using tensorrt-llm based on the examples/multimodal/run.py file. The parameters for their generate method are the same, as follows. Specifically, I used the default parameters in the generate method without any modifications. I set the batch size to 8. My question is: the batch inference results of the llava model are exactly the same for all 8 outputs. For the llava-next-video model, the 8 results are different. I want the batch inference results of the llava-next-video model to be exactly the same. How should I set the parameters to achieve the effect of the do_sample=False setting in the Hugging Face transformers model.generate method?


            output_ids = self.model.generate(
                input_ids,
                sampling_config=None,
                prompt_table=ptuning_args[0],
                max_new_tokens=max_new_tokens,
                end_id=end_id,
                pad_id=self.tokenizer.pad_token_id
                if self.tokenizer.pad_token_id is not None else
                self.tokenizer.all_special_ids[0],
                top_k=self.args.top_k,
                top_p=self.args.top_p,
                temperature=self.args.temperature,
                repetition_penalty=self.args.repetition_penalty,
                num_beams=self.args.num_beams,
                output_sequence_lengths=False,
                return_dict=False)
AmazDeng commented 2 months ago

Additionally, I also tried passing the do_sample=False parameter; the inference didn't throw any errors, but it was ineffective, as the results within a batch were not exactly identical.

AmazDeng commented 2 months ago

I ran the batch inference loop ten times. I found that although the results within each batch varied, the results between batches were exactly the same. So, where is the randomness in the tensorrt-llm generate method manifested? I now hope that the inference results within the same batch for the same prompt are completely identical. 1

byshiue commented 2 months ago

In TensorRT-LLM, if you don't setup beam_width (default value is 1), then it uses sampling. Under sampling, you could use top_k, top_p to control the sampling. If you set top_k = 1, it will use greedy search.

If you set beam_width > 1, then TRT-LLM will use beam search and ignore the top_k and top_p values.

AmazDeng commented 2 months ago

In TensorRT-LLM, if you don't setup beam_width (default value is 1), then it uses sampling. Under sampling, you could use top_k, top_p to control the sampling. If you set top_k = 1, it will use greedy search.

If you set beam_width > 1, then TRT-LLM will use beam search and ignore the top_k and top_p values.

So, there are two ways to set parameters:

  1. top_k=1;
  2. beam_width=1; Both methods can achieve the purpose of do_sample=False. Is this correct?
byshiue commented 1 month ago

Not fully correct. To achieve do_sample=False, you should set top_k=1 and beam_width=1 at the same time.