NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.81k stars 1.01k forks source link

[Bug] Lookahead decoding is nondeterministic and wrong after the first call to runner.generate #2263

Open tloen opened 2 months ago

tloen commented 2 months ago

System Info

Who can help?

@kaiyux @byshiue

Information

Tasks

Reproduction

Inside examples/run.py, add a for loop to the generation.

for _ in range(3): # THIS IS THE ONLY CHANGE
        with torch.no_grad():
            outputs = runner.generate(
                batch_input_ids=decoder_input_ids
                if is_enc_dec else batch_input_ids,
                encoder_input_ids=encoder_input_ids if is_enc_dec else None,
                encoder_input_features=encoder_input_features
                if is_enc_dec else None,
                encoder_output_lengths=encoder_output_lengths
                if is_enc_dec else None,
                max_new_tokens=args.max_output_len,
                max_attention_window_size=args.max_attention_window_size,
                sink_token_length=args.sink_token_length,
                end_id=end_id,
                pad_id=pad_id,
                temperature=args.temperature,
                top_k=args.top_k,
                top_p=args.top_p,
                num_beams=args.num_beams,
                length_penalty=args.length_penalty,
                early_stopping=args.early_stopping,
                repetition_penalty=args.repetition_penalty,
                presence_penalty=args.presence_penalty,
                frequency_penalty=args.frequency_penalty,
                stop_words_list=stop_words_list,
                bad_words_list=bad_words_list,
                output_cum_log_probs=(args.output_cum_log_probs_npy != None),
                output_log_probs=(args.output_log_probs_npy != None),
                random_seed=args.random_seed,
                lora_uids=args.lora_task_uids,
                prompt_table=args.prompt_table_path,
                prompt_tasks=args.prompt_tasks,
                streaming=args.streaming,
                output_sequence_lengths=True,
                no_repeat_ngram_size=args.no_repeat_ngram_size,
                return_dict=True,
                medusa_choices=args.medusa_choices,
                return_all_generated_tokens=args.return_all_generated_tokens,
                input_token_extra_ids=input_token_extra_ids)
            torch.cuda.synchronize()

        if args.streaming:
            for curr_outputs in throttle_generator(outputs,
                                                   args.streaming_interval):
                if runtime_rank == 0:
                    output_ids = curr_outputs['output_ids']
                    sequence_lengths = curr_outputs['sequence_lengths']
                    cum_log_probs = None
                    log_probs = None
                    if args.output_cum_log_probs_npy != None:
                        cum_log_probs = outputs['cum_log_probs']
                    if args.output_log_probs_npy != None:
                        log_probs = outputs['log_probs']
                    print_output(
                        tokenizer,
                        output_ids,
                        input_lengths,
                        sequence_lengths,
                        output_csv=args.output_csv,
                        output_npy=args.output_npy,
                        cum_log_probs=cum_log_probs,
                        log_probs=log_probs,
                        output_cum_log_probs_npy=args.output_cum_log_probs_npy,
                        output_log_probs_npy=args.output_log_probs_npy)
        else:
            if runtime_rank == 0:
                output_ids = outputs['output_ids']
                sequence_lengths = outputs['sequence_lengths']
                context_logits = None
                generation_logits = None
                cum_log_probs = None
                log_probs = None
                if runner.gather_context_logits:
                    context_logits = outputs['context_logits']
                if runner.gather_generation_logits:
                    generation_logits = outputs['generation_logits']
                if args.output_cum_log_probs_npy != None:
                    cum_log_probs = outputs['cum_log_probs']
                if args.output_log_probs_npy != None:
                    log_probs = outputs['log_probs']
                print_output(tokenizer,
                             output_ids,
                             input_lengths,
                             sequence_lengths,
                             output_csv=args.output_csv,
                             output_npy=args.output_npy,
                             context_logits=context_logits,
                             generation_logits=generation_logits,
                             output_logits_npy=args.output_logits_npy,
                             cum_log_probs=cum_log_probs,
                             log_probs=log_probs,
                             output_cum_log_probs_npy=args.output_cum_log_probs_npy,
                             output_log_probs_npy=args.output_log_probs_npy)
python run.py \
    --max_output_len=50 \
    --lookahead_config='[2,2,1]' \
    --tokenizer_dir=[DIR] \
    --engine_dir=[DIR]

Expected behavior

Input [Text 0]: "<|begin▁of▁sentence|>You are a diligent AI assistant that follows commands exactly.
### Instruction:
Please say "1" a thousand times.
### Response:
1, 1, 1, 1, 1,"
Output [Text 0 Beam 0]: " 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1"
Input [Text 0]: "<|begin▁of▁sentence|>You are a diligent AI assistant that follows commands exactly.
### Instruction:
Please say "1" a thousand times.
### Response:
1, 1, 1, 1, 1,"
Output [Text 0 Beam 0]: " 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1"
Input [Text 0]: "<|begin▁of▁sentence|>You are a diligent AI assistant that follows commands exactly.
### Instruction:
Please say "1" a thousand times.
### Response:
1, 1, 1, 1, 1,"
Output [Text 0 Beam 0]: " 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1"

actual behavior

Nondeterminism and incorrect responses after first iteration.

Input [Text 0]: "<|begin▁of▁sentence|>You are a diligent AI assistant that follows commands exactly.
### Instruction:
Please say "1" a thousand times.
### Response:
1, 1, 1, 1, 1,"
Output [Text 0 Beam 0]: " 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1"
Input [Text 0]: "<|begin▁of▁sentence|>You are a diligent AI assistant that follows commands exactly.
### Instruction:
Please say "1" a thousand times.
### Response:
1, 1, 1, 1, 1,"
Output [Text 0 Beam 0]: " 1, 1, 1, 1, 11 1111111111111111111111111111111111"
Input [Text 0]: "<|begin▁of▁sentence|>You are a diligent AI assistant that follows commands exactly.
### Instruction:
Please say "1" a thousand times.
### Response:
1, 1, 1, 1, 1,"
Output [Text 0 Beam 0]: " 1, 1, 1, 1, 1111111111111111111111111111111111111"

additional notes

Model is Llama architecture. max_draft_len is 107. Error doesn't happen when number of verification branches is zero or window size is 1.

davidmlw commented 1 month ago

Thank you very much! The bug has been fixed recently, and will be released soon

kaiyux commented 1 month ago

Hi @tloen , the issue should be addressed after this PR, can you please try and see if that solves the problem? Feel free to let us know if there are any more questions, thanks!