NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
7.31k stars 788 forks source link

Diversity Search not resulting in diverse outputs #1707

Open Bhuvanesh09 opened 3 weeks ago

Bhuvanesh09 commented 3 weeks ago

We found in one of the issues that TRTLLM supports Grouped Diverse Beam Search : https://github.com/NVIDIA/TensorRT-LLM/issues/79#issuecomment-1825401751

Yet, we are unable to change the groups and their sizes. We tried looking into the code for options but are unable to find it. Setting the beam_diversity_rate to the value of 2.0 doesn't lead to any perceivable increase in variance of the output.

Example:

SamplingConfig(end_id=2, pad_id=2, max_new_tokens=20, num_beams=3, max_kv_cache_length=None, output_sequence_lengths=False, return_dict=False, temperature=1.0, top_k=100, top_p=0.9, length_penalty=1.0, repetition_penalty=1.0, min_length=1, presence_penalty=0.0, use_beam_hyps=True, beam_search_diversity_rate=2.0, random_seed=8872, output_cum_log_probs=False, output_log_probs=False)

Leads to the output:

Greets to the human user, how are your day' s proceedings going on, is it smooth sailings?
Greets to the human user, how are your day' s proceedings going on, is it smooth sailings? or, have we hit a snags in our path.
Greets to the human user, how are your day' s proceedings going on, is it smooth sailings? or, have we hit a snags in our path. Regarding us the bOTS we at our end of service and duty.

and having the sampling config as ::

SamplingConfig(end_id=2, pad_id=2, max_new_tokens=20, num_beams=3, max_kv_cache_length=None, output_sequence_lengths=False, return_dict=False, temperature=1.0, top_k=100, top_p=0.9, length_penalty=1.0, repetition_penalty=1.0, min_length=1, presence_penalty=0.0, use_beam_hyps=True, beam_search_diversity_rate=0.5, random_seed=9613, output_cum_log_probs=False, output_log_probs=False)

leads to the output:

Greets to the human user, how are your day' s proceedings going on, is it smooth sailings?
Greets to the human user, how are your day' s proceedings going on, is it smooth sailings? or, have we hit a snags in our path.
Greets to the human user, how are your day' s proceedings going on, is it smooth sailings? or, have we hit a snags in our path. Regarding us the bOTS we at our end of service and duty.

Kindly guide us on how can we achieve greater diversity in our output. Thanks!

byshiue commented 3 weeks ago

Could you share how do you set/add the beam_search_diversity_rate? I take a try on llama 7b model and it works well on latest main branch.

python examples/llama/convert_checkpoint.py --model_dir /llama-models/llama-7b-hf/ \
                              --output_dir /tmp/tllm_checkpoint_1gpu_fp16 \
                              --dtype float16

python3 -m tensorrt_llm.commands.build --checkpoint_dir /tmp/tllm_checkpoint_1gpu_fp16 \
            --output_dir /tmp/tmp/llama/7B/trt_engines/fp16/1-gpu \
            --gemm_plugin auto \
            --max_beam_width 4

python examples/run.py --engine_dir /tmp/tmp/llama/7B/trt_engines/fp16/1-gpu --max_output_len 10 --use_py_session --tokenizer_dir /llama-models/llama-7b-hf/ --num_beams 4

default (beam_search_diversity_rate = 0)

Input [Text 0]: "<s> Born in north-east France, Soyer trained as a"
Output [Text 0 Beam 0]: "pastry chef before moving to London in 1"
Output [Text 0 Beam 1]: "pastry chef before moving to London in 2"
Output [Text 0 Beam 2]: "pastry chef in Paris before moving to London in"
Output [Text 0 Beam 3]: "pastry chef before moving to the UK in "

beam_search_diversity_rate = 2.0

Input [Text 0]: "<s> Born in north-east France, Soyer trained as a"
Output [Text 0 Beam 0]: "cook before working in restaurants in London and Paris"
Output [Text 0 Beam 1]: "cook before working in restaurants in London, Paris"
Output [Text 0 Beam 2]: "cook before working in restaurants in London, New"
Output [Text 0 Beam 3]: "cook before working in restaurants in London, including"
Bhuvanesh09 commented 3 weeks ago

@byshiue : Thanks for the prompt reply. In the example provided by you, there seems to be very little diversity "among the different beams" of a single prediction. Grouped beam search ensures that the same beams are not picked across group to ensure significant diversity. https://huggingface.co/docs/transformers/v4.18.0/en/main_classes/text_generation#transformers.generation_utils.GenerationMixin.group_beam_search

There is no option to set group width in context of beam search in TRT-LLM.

byshiue commented 2 weeks ago

From the paper, I remember diverse beam search only encourage choose different beams by the penalty. But the diversity is controlled by the penalty and cannot ensure you choose different beams.