Open taozhang9527 opened 9 months ago
summarize.py
is moved to https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples.
@taozhang9527 Did you test llama with run.py or launch the triton_server with trt-llm-llama? Something went wrong when i used run.py to test the trt-llm-llama engine and when i launched triton_server using the same engine.
@taozhang9527 Is the issue resolved for you?
I did not get the error for a single input, but I am getting similar error when tried for a batch size > 1 I am following code from : https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/llama/summarize_long.py tensorrt-llm version I am using: 0.9.0.dev2024020600
[02/27/2024-10:56:49] [TRT] [E] 3: [executionContext.cpp::setInputShape::2309] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::setInputShape::2309, condition: satisfyProfile Runtime dimension does not satisfy any optimization profile.)
[02/27/2024-10:56:49] [TRT] [E] 3: [executionContext.cpp::resolveSlots::2991] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::resolveSlots::2991, condition: allInputDimensionsSpecified(routine) )
[02/27/2024-10:56:49] [TRT] [E] 3: [executionContext.cpp::resolveSlots::2991] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::resolveSlots::2991, condition: allInputDimensionsSpecified(routine) )
/tmp/ipykernel_10529/1659061603.py:17: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
torch.tensor(t, dtype=torch.int32).cuda() for t in line_encoded
/home/azureuser/.conda/envs/tensorrt-venv/lib/python3.10/site-packages/tensorrt_llm/runtime/generation.py:933: UserWarning: The PyTorch API of nested tensors is in prototype stage and will change in the near future. (Triggered internally at ../aten/src/ATen/NestedTensorImpl.cpp:178.)
torch.nested.nested_tensor(split_ids_list,
RuntimeError Traceback (most recent call last)
Cell In[10], line 1
----> 1 main(args, tensorrt_llm_llama, [input_text]*2)
2 # main(args, tensorrt_llm_llama, input_text)
Cell In[4], line 20, in main(args, tensorrt_llm_llama, input_text_list)
17 end_id = tokenizer.encode(tokenizer.eos_token, add_special_tokens=False)[0]
19 start = time.time()
---> 20 summary, _ = summarize_tensorrt_llm(input_text_list, tokenizer, tensorrt_llm_llama, args)
21 print(f"time taken to infer with tensorrt llm is {time.time()-start}")
22 if runtime_rank == 0:
Cell In[5], line 62, in summarize_tensorrt_llm(data_batch, tokenizer, tensorrt_llm_llama, args)
60 print("shape of line_encoded: ", line_encoded[0].shape, line_encoded[1].shape)
61 if tensorrt_llm_llama.remove_input_padding:
---> 62 output_ids = tensorrt_llm_llama.decode_batch(
63 line_encoded, sampling_config)
64 else:
65 output_ids = tensorrt_llm_llama.decode(
66 line_encoded,
67 input_lengths,
68 sampling_config,
69 )
File ~/.conda/envs/tensorrt-venv/lib/python3.10/site-packages/tensorrt_llm/runtime/generation.py:2735, in GenerationSession.decode_batch(self, input_ids, sampling_config, streaming, **kwargs)
2729 def decode_batch(self,
2730 input_ids: Sequence[torch.Tensor],
2731 sampling_config: SamplingConfig,
2732 streaming: bool = False,
2733 **kwargs):
2734 input_ids, context_lengths = _prepare_input_ids(input_ids)
-> 2735 return self.decode(input_ids,
2736 context_lengths,
2737 sampling_config,
2738 streaming=streaming,
2739 **kwargs)
File ~/.conda/envs/tensorrt-venv/lib/python3.10/site-packages/tensorrt_llm/runtime/generation.py:757, in GenerationSession.cuda_stream_guard.<locals>.wrapper(self, *args, **kwargs)
755 external_stream.synchronize()
756 torch.cuda.set_stream(self.stream)
--> 757 ret = func(self, *args, **kwargs)
758 if external_stream != self.stream:
759 self.stream.synchronize()
File ~/.conda/envs/tensorrt-venv/lib/python3.10/site-packages/tensorrt_llm/runtime/generation.py:2893, in GenerationSession.decode(self, input_ids, context_lengths, sampling_config, prompt_embedding_table, tasks, prompt_vocab_size, stop_words_list, bad_words_list, no_repeat_ngram_size, streaming, output_sequence_lengths, return_dict, encoder_output, encoder_input_lengths, stopping_criteria, logits_processor, cross_attention_mask, **kwargs)
2883 return self.decode_stream(
2884 batch_size, scfg, sequence_lengths, context_lengths,
2885 host_context_lengths, max_context_length, beam_width,
(...)
2890 encoder_output, encoder_input_lengths, stopping_criteria,
2891 logits_processor, cross_attention_mask, **kwargs)
2892 else:
-> 2893 return self.decode_regular(
2894 batch_size, scfg, sequence_lengths, context_lengths,
2895 host_context_lengths, max_context_length, beam_width,
2896 cache_indirections, input_ids, hidden_states,
2897 prompt_embedding_table, tasks, prompt_vocab_size, ite,
2898 sequence_limit_lengths, stop_words_list, bad_words_list,
2899 no_repeat_ngram_size, output_sequence_lengths, return_dict,
2900 encoder_output, encoder_input_lengths, stopping_criteria,
2901 logits_processor, cross_attention_mask, **kwargs)
File ~/.conda/envs/tensorrt-venv/lib/python3.10/site-packages/tensorrt_llm/runtime/generation.py:2550, in GenerationSession.decode_regular(self, batch_size, scfg, sequence_lengths, context_lengths, host_context_lengths, max_context_length, beam_width, cache_indirections, input_ids, hidden_states, prompt_embedding_table, tasks, prompt_vocab_size, ite, sequence_limit_lengths, stop_words_list, bad_words_list, no_repeat_ngram_size, output_sequence_lengths, return_dict, encoder_output, encoder_input_lengths, stopping_criteria, logits_processor, cross_attention_mask, **kwargs)
2547 last_token_ids = torch.cumsum(context_lengths.clone().detach(),
2548 dim=0).int()
2549 for step in range(0, self.max_new_tokens):
-> 2550 should_stop, next_step_tensors, tasks, context_lengths, host_context_lengths, attention_mask, logits, encoder_input_lengths = self.handle_per_step(
2551 cache_indirections, step, batch_size, max_context_length,
2552 beam_width, input_ids, hidden_states, scfg,
2553 kv_cache_block_pointers, host_kv_cache_block_pointers,
2554 prompt_embedding_table, tasks, context_lengths,
2555 host_context_lengths, attention_mask, cross_attention_mask,
2556 prompt_vocab_size, ite, sequence_limit_lengths,
2557 sequence_lengths, next_step_tensors, stop_words_list,
2558 bad_words_list, no_repeat_ngram_size, encoder_output,
2559 encoder_input_lengths, stopping_criteria, logits_processor,
2560 **kwargs)
2561 if step == 0:
2562 if benchmark_profiler is not None:
File ~/.conda/envs/tensorrt-venv/lib/python3.10/site-packages/tensorrt_llm/runtime/generation.py:2233, in GenerationSession.handle_per_step(self, cache_indirections, step, batch_size, max_context_length, beam_width, input_ids, hidden_states, scfg, kv_cache_block_pointers, host_kv_cache_block_pointers, prompt_embedding_table, tasks, context_lengths, host_context_lengths, attention_mask, cross_attention_mask, prompt_vocab_size, ite, sequence_limit_lengths, sequence_lengths, next_step_tensors, stop_words_list, bad_words_list, no_repeat_ngram_size, encoder_output, encoder_input_lengths, stopping_criteria, logits_processor, **kwargs)
2230 ok = self.runtime._run(context, stream)
2232 if not ok:
-> 2233 raise RuntimeError(f"Executing TRT engine failed step={step}!")
2234 if self.debug_mode:
2235 torch.cuda.synchronize()
RuntimeError: Executing TRT engine failed step=0!```
@taozhang9527 Is the issue resolved for you?
I did not get the error for a single input, but I am getting similar error when tried for a batch size > 1 I am following code from : https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/llama/summarize_long.py tensorrt-llm version I am using: 0.9.0.dev2024020600
[02/27/2024-10:56:49] [TRT] [E] 3: [executionContext.cpp::setInputShape::2309] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::setInputShape::2309, condition: satisfyProfile Runtime dimension does not satisfy any optimization profile.) [02/27/2024-10:56:49] [TRT] [E] 3: [executionContext.cpp::resolveSlots::2991] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::resolveSlots::2991, condition: allInputDimensionsSpecified(routine) ) [02/27/2024-10:56:49] [TRT] [E] 3: [executionContext.cpp::resolveSlots::2991] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::resolveSlots::2991, condition: allInputDimensionsSpecified(routine) ) /tmp/ipykernel_10529/1659061603.py:17: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor). torch.tensor(t, dtype=torch.int32).cuda() for t in line_encoded /home/azureuser/.conda/envs/tensorrt-venv/lib/python3.10/site-packages/tensorrt_llm/runtime/generation.py:933: UserWarning: The PyTorch API of nested tensors is in prototype stage and will change in the near future. (Triggered internally at ../aten/src/ATen/NestedTensorImpl.cpp:178.) torch.nested.nested_tensor(split_ids_list, RuntimeError Traceback (most recent call last) Cell In[10], line 1 ----> 1 main(args, tensorrt_llm_llama, [input_text]*2) 2 # main(args, tensorrt_llm_llama, input_text) Cell In[4], line 20, in main(args, tensorrt_llm_llama, input_text_list) 17 end_id = tokenizer.encode(tokenizer.eos_token, add_special_tokens=False)[0] 19 start = time.time() ---> 20 summary, _ = summarize_tensorrt_llm(input_text_list, tokenizer, tensorrt_llm_llama, args) 21 print(f"time taken to infer with tensorrt llm is {time.time()-start}") 22 if runtime_rank == 0: Cell In[5], line 62, in summarize_tensorrt_llm(data_batch, tokenizer, tensorrt_llm_llama, args) 60 print("shape of line_encoded: ", line_encoded[0].shape, line_encoded[1].shape) 61 if tensorrt_llm_llama.remove_input_padding: ---> 62 output_ids = tensorrt_llm_llama.decode_batch( 63 line_encoded, sampling_config) 64 else: 65 output_ids = tensorrt_llm_llama.decode( 66 line_encoded, 67 input_lengths, 68 sampling_config, 69 ) File ~/.conda/envs/tensorrt-venv/lib/python3.10/site-packages/tensorrt_llm/runtime/generation.py:2735, in GenerationSession.decode_batch(self, input_ids, sampling_config, streaming, **kwargs) 2729 def decode_batch(self, 2730 input_ids: Sequence[torch.Tensor], 2731 sampling_config: SamplingConfig, 2732 streaming: bool = False, 2733 **kwargs): 2734 input_ids, context_lengths = _prepare_input_ids(input_ids) -> 2735 return self.decode(input_ids, 2736 context_lengths, 2737 sampling_config, 2738 streaming=streaming, 2739 **kwargs) File ~/.conda/envs/tensorrt-venv/lib/python3.10/site-packages/tensorrt_llm/runtime/generation.py:757, in GenerationSession.cuda_stream_guard.<locals>.wrapper(self, *args, **kwargs) 755 external_stream.synchronize() 756 torch.cuda.set_stream(self.stream) --> 757 ret = func(self, *args, **kwargs) 758 if external_stream != self.stream: 759 self.stream.synchronize() File ~/.conda/envs/tensorrt-venv/lib/python3.10/site-packages/tensorrt_llm/runtime/generation.py:2893, in GenerationSession.decode(self, input_ids, context_lengths, sampling_config, prompt_embedding_table, tasks, prompt_vocab_size, stop_words_list, bad_words_list, no_repeat_ngram_size, streaming, output_sequence_lengths, return_dict, encoder_output, encoder_input_lengths, stopping_criteria, logits_processor, cross_attention_mask, **kwargs) 2883 return self.decode_stream( 2884 batch_size, scfg, sequence_lengths, context_lengths, 2885 host_context_lengths, max_context_length, beam_width, (...) 2890 encoder_output, encoder_input_lengths, stopping_criteria, 2891 logits_processor, cross_attention_mask, **kwargs) 2892 else: -> 2893 return self.decode_regular( 2894 batch_size, scfg, sequence_lengths, context_lengths, 2895 host_context_lengths, max_context_length, beam_width, 2896 cache_indirections, input_ids, hidden_states, 2897 prompt_embedding_table, tasks, prompt_vocab_size, ite, 2898 sequence_limit_lengths, stop_words_list, bad_words_list, 2899 no_repeat_ngram_size, output_sequence_lengths, return_dict, 2900 encoder_output, encoder_input_lengths, stopping_criteria, 2901 logits_processor, cross_attention_mask, **kwargs) File ~/.conda/envs/tensorrt-venv/lib/python3.10/site-packages/tensorrt_llm/runtime/generation.py:2550, in GenerationSession.decode_regular(self, batch_size, scfg, sequence_lengths, context_lengths, host_context_lengths, max_context_length, beam_width, cache_indirections, input_ids, hidden_states, prompt_embedding_table, tasks, prompt_vocab_size, ite, sequence_limit_lengths, stop_words_list, bad_words_list, no_repeat_ngram_size, output_sequence_lengths, return_dict, encoder_output, encoder_input_lengths, stopping_criteria, logits_processor, cross_attention_mask, **kwargs) 2547 last_token_ids = torch.cumsum(context_lengths.clone().detach(), 2548 dim=0).int() 2549 for step in range(0, self.max_new_tokens): -> 2550 should_stop, next_step_tensors, tasks, context_lengths, host_context_lengths, attention_mask, logits, encoder_input_lengths = self.handle_per_step( 2551 cache_indirections, step, batch_size, max_context_length, 2552 beam_width, input_ids, hidden_states, scfg, 2553 kv_cache_block_pointers, host_kv_cache_block_pointers, 2554 prompt_embedding_table, tasks, context_lengths, 2555 host_context_lengths, attention_mask, cross_attention_mask, 2556 prompt_vocab_size, ite, sequence_limit_lengths, 2557 sequence_lengths, next_step_tensors, stop_words_list, 2558 bad_words_list, no_repeat_ngram_size, encoder_output, 2559 encoder_input_lengths, stopping_criteria, logits_processor, 2560 **kwargs) 2561 if step == 0: 2562 if benchmark_profiler is not None: File ~/.conda/envs/tensorrt-venv/lib/python3.10/site-packages/tensorrt_llm/runtime/generation.py:2233, in GenerationSession.handle_per_step(self, cache_indirections, step, batch_size, max_context_length, beam_width, input_ids, hidden_states, scfg, kv_cache_block_pointers, host_kv_cache_block_pointers, prompt_embedding_table, tasks, context_lengths, host_context_lengths, attention_mask, cross_attention_mask, prompt_vocab_size, ite, sequence_limit_lengths, sequence_lengths, next_step_tensors, stop_words_list, bad_words_list, no_repeat_ngram_size, encoder_output, encoder_input_lengths, stopping_criteria, logits_processor, **kwargs) 2230 ok = self.runtime._run(context, stream) 2232 if not ok: -> 2233 raise RuntimeError(f"Executing TRT engine failed step={step}!") 2234 if self.debug_mode: 2235 torch.cuda.synchronize() RuntimeError: Executing TRT engine failed step=0!```
From the error message, it looks your request does not satisfy the constraint you set during building engine. Please check your parameters again.
In the 0.5 release,
summarize.py
is used for summarization benchmark. However, in the latest 0.6.1 release, thesummarize.py
does not exist. I can only find thesummarize_long.py
.Following the instructions by replacing the
summarize.py
withsummarize_long.py
, I got the following errors: