Open dfcastap opened 6 months ago
I also meet the same problem with the whisper model. Unfortunately, removing the self.buffer['logits'] = logits
doesn't seem to solve the problem. It crashes due to an illegal memory access. The interesting thing is that this problem doesn't happen for a num_beams=1
Interesting... So there's a mix of parameters and/models that affect this behaviour. In my case, for LlaVA, num_beams=1
(as per default params in examples/multimodal/run.py
for the model) but I did have to remove the self.buffer['logits'] = logits
to get the output I expected.
num_beams=1 and remove setting self.buffer['logits'] = logits doesn't even help me. I'm trying to integrate a JSONLogitsProcessor from Outlines (https://github.com/outlines-dev/outlines/blob/main/outlines/integrations/vllm.py) to trt_llm. Logits are masked from my logits processor after the first iteration. Here how it looks like - next_token_logits: tensor([[[ -inf, 184.9984, -inf, ..., -inf, -inf, -inf]]], device='cuda:0')
Without logits processor looks like this - next_token_logits: tensor([[[207.3016, 238.9770, 137.5957, ..., 207.0755, 216.6228, 211.6978]]], device='cuda:0')
Then what happens as far as i understand it goes through,
should_stop = self.dynamic_decoder.forward( next_token_logits, decode_step, max_context_length, self.max_attention_window_size, self.sink_token_length, ite, batch_size, self.end_ids, self.embedding_bias_opt, context_lengths, sequence_limit_lengths, stop_words_list_ptrs, stop_words_lens, max_stop_words_len, bad_words_list_ptrs, bad_words_lens, max_bad_words_len, no_repeat_ngram_size, this_src_cache_indirection, self.output_ids, self.new_tokens, self.finished, self.finished, self.sequence_length_buffer, self.cum_log_probs, self.log_probs, self.log_probs_tiled, self.parent_ids, this_tgt_cache_indirection, self.beam_hyps_output_ids_tgt, self.beam_hyps_sequence_lengths_tgt, self.beam_hyps_cum_log_probs, self.beam_hyps_normed_scores, self.beam_hyps_log_probs, self.beam_hyps_min_normed_scores, self.beam_hyps_num_beams, self.beam_hyps_is_done, scfg.use_beam_hyps)
should_stop always become tensor([True]). I'm not sure something goes wrong when -INF values are present in the following kernel invocations.
System Info
CPU: X86_64 GPU: L4 RAM: 48GB OS: Debian 11 Python: 3.10 TensorRT-LLM version: 0.9.0.dev2024022700 TensorRT: 9.2.0.post12.dev5 CUDA Version: 12.2 NVIDIA Driver Version: 535.86.10
Who can help?
@byshiue
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Running the multi-modal example (
run.py
) for LlaVA but editing the generate call to pass alogits_processor
produces weird behaviour where the same token is generated over and over again. Specifically, what I did was editing (source inrun.py
):to be
Where
logits_processor
is a function that modifies the logits tensor. But it seems like any operation that toucheslogits
breaks any further generation step. Went so far as to definelogits_processor
as:The generated output tokens where all the same.
Interestingly, when debugging, I edited the
tensorrt_llm/runtime/generation.py
(source) to remove settingself.buffer['logits'] = logits
and the problem went away.Expected behavior
I want to apply the logits processor feature to enable things like JSON decoding.
actual behavior
Adding a logits processor that edits
logits
in the generation step in any way, causes the model to output the same token over and over again.additional notes
What are the consequences of removing the
self.buffer['logits'] = logits
line in the generation script? Is this a bug?Thank you for the help!