NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.23k stars 912 forks source link

Issue with logits_processor logic in token generation process #1175

Open dfcastap opened 6 months ago

dfcastap commented 6 months ago

System Info

CPU: X86_64 GPU: L4 RAM: 48GB OS: Debian 11 Python: 3.10 TensorRT-LLM version: 0.9.0.dev2024022700 TensorRT: 9.2.0.post12.dev5 CUDA Version: 12.2 NVIDIA Driver Version: 535.86.10

Who can help?

@byshiue

Information

Tasks

Reproduction

Running the multi-modal example (run.py) for LlaVA but editing the generate call to pass a logits_processor produces weird behaviour where the same token is generated over and over again. Specifically, what I did was editing (source in run.py):

output_ids = self.model.generate(
                input_ids,
                sampling_config=None,
                prompt_table_path='prompt_table.npy',
                max_new_tokens=max_new_tokens,
                end_id=end_id,
                pad_id=self.tokenizer.pad_token_id,
                top_k=self.args.top_k,
                num_beams=self.args.num_beams,
                output_sequence_lengths=False,
                return_dict=False)

to be

output_ids = self.model.generate(
                input_ids,
                sampling_config=None,
                prompt_table_path='prompt_table.npy',
                max_new_tokens=max_new_tokens,
                end_id=end_id,
                pad_id=self.tokenizer.pad_token_id,
                top_k=self.args.top_k,
                num_beams=self.args.num_beams,
                output_sequence_lengths=False,
                return_dict=False,
                logits_processor=logits_processor)

Where logits_processor is a function that modifies the logits tensor. But it seems like any operation that touches logits breaks any further generation step. Went so far as to define logits_processor as:

def logits_processor(step, final_output_ids_, logits):
    return 1.0*logits

The generated output tokens where all the same.

Interestingly, when debugging, I edited the tensorrt_llm/runtime/generation.py (source) to remove setting self.buffer['logits'] = logits and the problem went away.

Expected behavior

I want to apply the logits processor feature to enable things like JSON decoding.

actual behavior

Adding a logits processor that edits logits in the generation step in any way, causes the model to output the same token over and over again.

additional notes

What are the consequences of removing the self.buffer['logits'] = logits line in the generation script? Is this a bug?

Thank you for the help!

OValery16 commented 6 months ago

I also meet the same problem with the whisper model. Unfortunately, removing the self.buffer['logits'] = logits doesn't seem to solve the problem. It crashes due to an illegal memory access. The interesting thing is that this problem doesn't happen for a num_beams=1

dfcastap commented 6 months ago

Interesting... So there's a mix of parameters and/models that affect this behaviour. In my case, for LlaVA, num_beams=1 (as per default params in examples/multimodal/run.py for the model) but I did have to remove the self.buffer['logits'] = logits to get the output I expected.

chinthysl commented 6 months ago

num_beams=1 and remove setting self.buffer['logits'] = logits doesn't even help me. I'm trying to integrate a JSONLogitsProcessor from Outlines (https://github.com/outlines-dev/outlines/blob/main/outlines/integrations/vllm.py) to trt_llm. Logits are masked from my logits processor after the first iteration. Here how it looks like - next_token_logits: tensor([[[ -inf, 184.9984, -inf, ..., -inf, -inf, -inf]]], device='cuda:0') Without logits processor looks like this - next_token_logits: tensor([[[207.3016, 238.9770, 137.5957, ..., 207.0755, 216.6228, 211.6978]]], device='cuda:0') Then what happens as far as i understand it goes through, should_stop = self.dynamic_decoder.forward( next_token_logits, decode_step, max_context_length, self.max_attention_window_size, self.sink_token_length, ite, batch_size, self.end_ids, self.embedding_bias_opt, context_lengths, sequence_limit_lengths, stop_words_list_ptrs, stop_words_lens, max_stop_words_len, bad_words_list_ptrs, bad_words_lens, max_bad_words_len, no_repeat_ngram_size, this_src_cache_indirection, self.output_ids, self.new_tokens, self.finished, self.finished, self.sequence_length_buffer, self.cum_log_probs, self.log_probs, self.log_probs_tiled, self.parent_ids, this_tgt_cache_indirection, self.beam_hyps_output_ids_tgt, self.beam_hyps_sequence_lengths_tgt, self.beam_hyps_cum_log_probs, self.beam_hyps_normed_scores, self.beam_hyps_log_probs, self.beam_hyps_min_normed_scores, self.beam_hyps_num_beams, self.beam_hyps_is_done, scfg.use_beam_hyps)

should_stop always become tensor([True]). I'm not sure something goes wrong when -INF values are present in the following kernel invocations.