NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.2k stars 908 forks source link

Summarization Benchmark Script Not Working for 0.6.1 Release #635

Open taozhang9527 opened 9 months ago

taozhang9527 commented 9 months ago

In the 0.5 release, summarize.py is used for summarization benchmark. However, in the latest 0.6.1 release, the summarize.py does not exist. I can only find the summarize_long.py.

Following the instructions by replacing the summarize.py with summarize_long.py, I got the following errors:

[12/12/2023-04:17:51] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 35682 (MiB)
[12/12/2023-04:17:51] [TRT-LLM] [I] Load engine takes: 22.338767528533936 sec
/code/tensorrt_llm/examples/llama/summarize_long.py:209: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor
(sourceTensor).
  torch.tensor(t, dtype=torch.int32).cuda() for t in line_encoded
[12/12/2023-04:17:53] [TRT-LLM] [I] Generation session set up with the parameters:             batch_size: 1,             max_context_length: 6483,             max_new_tokens: 128,             beam_width: 1,             max_attention_win
dow_size: 4096
[12/12/2023-04:17:53] [TRT] [E] 3: [executionContext.cpp::setInputShape::2309] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::setInputShape::2309, condition: satisfyProfile Runtime dimension d
oes not satisfy any optimization profile.)
[12/12/2023-04:17:53] [TRT] [E] 3: [executionContext.cpp::setInputShape::2309] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::setInputShape::2309, condition: satisfyProfile Runtime dimension d
oes not satisfy any optimization profile.)
[12/12/2023-04:17:53] [TRT] [E] 3: [executionContext.cpp::setInputShape::2309] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::setInputShape::2309, condition: satisfyProfile Runtime dimension d
oes not satisfy any optimization profile.)
[12/12/2023-04:17:53] [TRT] [E] 3: [executionContext.cpp::setInputShape::2309] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::setInputShape::2309, condition: satisfyProfile Runtime dimension d
oes not satisfy any optimization profile.)
...

Traceback (most recent call last):
  File "/code/tensorrt_llm/examples/llama/summarize_long.py", line 379, in <module>
    main(args)
  File "/code/tensorrt_llm/examples/llama/summarize_long.py", line 309, in main
    summarize_tensorrt_llm(datapoints[ite], tokenizer,
  File "/code/tensorrt_llm/examples/llama/summarize_long.py", line 247, in summarize_tensorrt_llm
    output_ids = tensorrt_llm_llama.decode_batch(
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/generation.py", line 2131, in decode_batch
    return self.decode(input_ids,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/generation.py", line 697, in wrapper
    ret = func(self, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/generation.py", line 2247, in decode
    return self.decode_regular(
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/generation.py", line 1980, in decode_regular
    should_stop, next_step_tensors, tasks, context_lengths, host_context_lengths, attention_mask, logits, encoder_input_lengths = self.handle_per_step(
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/generation.py", line 1715, in handle_per_step
    raise RuntimeError('Executing TRT engine failed!')
RuntimeError: Executing TRT engine failed!
byshiue commented 9 months ago

summarize.py is moved to https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples.

THU-mjx commented 9 months ago

@taozhang9527 Did you test llama with run.py or launch the triton_server with trt-llm-llama? Something went wrong when i used run.py to test the trt-llm-llama engine and when i launched triton_server using the same engine.

vishnula-kore commented 6 months ago

@taozhang9527 Is the issue resolved for you?

I did not get the error for a single input, but I am getting similar error when tried for a batch size > 1 I am following code from : https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/llama/summarize_long.py tensorrt-llm version I am using: 0.9.0.dev2024020600


[02/27/2024-10:56:49] [TRT] [E] 3: [executionContext.cpp::setInputShape::2309] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::setInputShape::2309, condition: satisfyProfile Runtime dimension does not satisfy any optimization profile.)
[02/27/2024-10:56:49] [TRT] [E] 3: [executionContext.cpp::resolveSlots::2991] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::resolveSlots::2991, condition: allInputDimensionsSpecified(routine) )
[02/27/2024-10:56:49] [TRT] [E] 3: [executionContext.cpp::resolveSlots::2991] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::resolveSlots::2991, condition: allInputDimensionsSpecified(routine) )
/tmp/ipykernel_10529/1659061603.py:17: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  torch.tensor(t, dtype=torch.int32).cuda() for t in line_encoded
/home/azureuser/.conda/envs/tensorrt-venv/lib/python3.10/site-packages/tensorrt_llm/runtime/generation.py:933: UserWarning: The PyTorch API of nested tensors is in prototype stage and will change in the near future. (Triggered internally at ../aten/src/ATen/NestedTensorImpl.cpp:178.)
  torch.nested.nested_tensor(split_ids_list,
RuntimeError                              Traceback (most recent call last)
Cell In[10], line 1
----> 1 main(args, tensorrt_llm_llama, [input_text]*2)
      2 # main(args, tensorrt_llm_llama, input_text)

Cell In[4], line 20, in main(args, tensorrt_llm_llama, input_text_list)
     17 end_id = tokenizer.encode(tokenizer.eos_token, add_special_tokens=False)[0]
     19 start = time.time()
---> 20 summary, _ = summarize_tensorrt_llm(input_text_list, tokenizer, tensorrt_llm_llama, args)
     21 print(f"time taken to infer with tensorrt llm is {time.time()-start}")
     22 if runtime_rank == 0:

Cell In[5], line 62, in summarize_tensorrt_llm(data_batch, tokenizer, tensorrt_llm_llama, args)
     60 print("shape of line_encoded: ", line_encoded[0].shape, line_encoded[1].shape) 
     61 if tensorrt_llm_llama.remove_input_padding:
---> 62     output_ids = tensorrt_llm_llama.decode_batch(
     63         line_encoded, sampling_config)
     64 else:
     65     output_ids = tensorrt_llm_llama.decode(
     66         line_encoded,
     67         input_lengths,
     68         sampling_config,
     69     )

File ~/.conda/envs/tensorrt-venv/lib/python3.10/site-packages/tensorrt_llm/runtime/generation.py:2735, in GenerationSession.decode_batch(self, input_ids, sampling_config, streaming, **kwargs)
   2729 def decode_batch(self,
   2730                  input_ids: Sequence[torch.Tensor],
   2731                  sampling_config: SamplingConfig,
   2732                  streaming: bool = False,
   2733                  **kwargs):
   2734     input_ids, context_lengths = _prepare_input_ids(input_ids)
-> 2735     return self.decode(input_ids,
   2736                        context_lengths,
   2737                        sampling_config,
   2738                        streaming=streaming,
   2739                        **kwargs)

File ~/.conda/envs/tensorrt-venv/lib/python3.10/site-packages/tensorrt_llm/runtime/generation.py:757, in GenerationSession.cuda_stream_guard.<locals>.wrapper(self, *args, **kwargs)
    755     external_stream.synchronize()
    756     torch.cuda.set_stream(self.stream)
--> 757 ret = func(self, *args, **kwargs)
    758 if external_stream != self.stream:
    759     self.stream.synchronize()

File ~/.conda/envs/tensorrt-venv/lib/python3.10/site-packages/tensorrt_llm/runtime/generation.py:2893, in GenerationSession.decode(self, input_ids, context_lengths, sampling_config, prompt_embedding_table, tasks, prompt_vocab_size, stop_words_list, bad_words_list, no_repeat_ngram_size, streaming, output_sequence_lengths, return_dict, encoder_output, encoder_input_lengths, stopping_criteria, logits_processor, cross_attention_mask, **kwargs)
   2883     return self.decode_stream(
   2884         batch_size, scfg, sequence_lengths, context_lengths,
   2885         host_context_lengths, max_context_length, beam_width,
   (...)
   2890         encoder_output, encoder_input_lengths, stopping_criteria,
   2891         logits_processor, cross_attention_mask, **kwargs)
   2892 else:
-> 2893     return self.decode_regular(
   2894         batch_size, scfg, sequence_lengths, context_lengths,
   2895         host_context_lengths, max_context_length, beam_width,
   2896         cache_indirections, input_ids, hidden_states,
   2897         prompt_embedding_table, tasks, prompt_vocab_size, ite,
   2898         sequence_limit_lengths, stop_words_list, bad_words_list,
   2899         no_repeat_ngram_size, output_sequence_lengths, return_dict,
   2900         encoder_output, encoder_input_lengths, stopping_criteria,
   2901         logits_processor, cross_attention_mask, **kwargs)

File ~/.conda/envs/tensorrt-venv/lib/python3.10/site-packages/tensorrt_llm/runtime/generation.py:2550, in GenerationSession.decode_regular(self, batch_size, scfg, sequence_lengths, context_lengths, host_context_lengths, max_context_length, beam_width, cache_indirections, input_ids, hidden_states, prompt_embedding_table, tasks, prompt_vocab_size, ite, sequence_limit_lengths, stop_words_list, bad_words_list, no_repeat_ngram_size, output_sequence_lengths, return_dict, encoder_output, encoder_input_lengths, stopping_criteria, logits_processor, cross_attention_mask, **kwargs)
   2547 last_token_ids = torch.cumsum(context_lengths.clone().detach(),
   2548                               dim=0).int()
   2549 for step in range(0, self.max_new_tokens):
-> 2550     should_stop, next_step_tensors, tasks, context_lengths, host_context_lengths, attention_mask, logits, encoder_input_lengths = self.handle_per_step(
   2551         cache_indirections, step, batch_size, max_context_length,
   2552         beam_width, input_ids, hidden_states, scfg,
   2553         kv_cache_block_pointers, host_kv_cache_block_pointers,
   2554         prompt_embedding_table, tasks, context_lengths,
   2555         host_context_lengths, attention_mask, cross_attention_mask,
   2556         prompt_vocab_size, ite, sequence_limit_lengths,
   2557         sequence_lengths, next_step_tensors, stop_words_list,
   2558         bad_words_list, no_repeat_ngram_size, encoder_output,
   2559         encoder_input_lengths, stopping_criteria, logits_processor,
   2560         **kwargs)
   2561     if step == 0:
   2562         if benchmark_profiler is not None:

File ~/.conda/envs/tensorrt-venv/lib/python3.10/site-packages/tensorrt_llm/runtime/generation.py:2233, in GenerationSession.handle_per_step(self, cache_indirections, step, batch_size, max_context_length, beam_width, input_ids, hidden_states, scfg, kv_cache_block_pointers, host_kv_cache_block_pointers, prompt_embedding_table, tasks, context_lengths, host_context_lengths, attention_mask, cross_attention_mask, prompt_vocab_size, ite, sequence_limit_lengths, sequence_lengths, next_step_tensors, stop_words_list, bad_words_list, no_repeat_ngram_size, encoder_output, encoder_input_lengths, stopping_criteria, logits_processor, **kwargs)
   2230     ok = self.runtime._run(context, stream)
   2232 if not ok:
-> 2233     raise RuntimeError(f"Executing TRT engine failed step={step}!")
   2234 if self.debug_mode:
   2235     torch.cuda.synchronize()

RuntimeError: Executing TRT engine failed step=0!```
byshiue commented 6 months ago

@taozhang9527 Is the issue resolved for you?

I did not get the error for a single input, but I am getting similar error when tried for a batch size > 1 I am following code from : https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/llama/summarize_long.py tensorrt-llm version I am using: 0.9.0.dev2024020600

[02/27/2024-10:56:49] [TRT] [E] 3: [executionContext.cpp::setInputShape::2309] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::setInputShape::2309, condition: satisfyProfile Runtime dimension does not satisfy any optimization profile.)
[02/27/2024-10:56:49] [TRT] [E] 3: [executionContext.cpp::resolveSlots::2991] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::resolveSlots::2991, condition: allInputDimensionsSpecified(routine) )
[02/27/2024-10:56:49] [TRT] [E] 3: [executionContext.cpp::resolveSlots::2991] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::resolveSlots::2991, condition: allInputDimensionsSpecified(routine) )
/tmp/ipykernel_10529/1659061603.py:17: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  torch.tensor(t, dtype=torch.int32).cuda() for t in line_encoded
/home/azureuser/.conda/envs/tensorrt-venv/lib/python3.10/site-packages/tensorrt_llm/runtime/generation.py:933: UserWarning: The PyTorch API of nested tensors is in prototype stage and will change in the near future. (Triggered internally at ../aten/src/ATen/NestedTensorImpl.cpp:178.)
  torch.nested.nested_tensor(split_ids_list,
RuntimeError                              Traceback (most recent call last)
Cell In[10], line 1
----> 1 main(args, tensorrt_llm_llama, [input_text]*2)
      2 # main(args, tensorrt_llm_llama, input_text)

Cell In[4], line 20, in main(args, tensorrt_llm_llama, input_text_list)
     17 end_id = tokenizer.encode(tokenizer.eos_token, add_special_tokens=False)[0]
     19 start = time.time()
---> 20 summary, _ = summarize_tensorrt_llm(input_text_list, tokenizer, tensorrt_llm_llama, args)
     21 print(f"time taken to infer with tensorrt llm is {time.time()-start}")
     22 if runtime_rank == 0:

Cell In[5], line 62, in summarize_tensorrt_llm(data_batch, tokenizer, tensorrt_llm_llama, args)
     60 print("shape of line_encoded: ", line_encoded[0].shape, line_encoded[1].shape) 
     61 if tensorrt_llm_llama.remove_input_padding:
---> 62     output_ids = tensorrt_llm_llama.decode_batch(
     63         line_encoded, sampling_config)
     64 else:
     65     output_ids = tensorrt_llm_llama.decode(
     66         line_encoded,
     67         input_lengths,
     68         sampling_config,
     69     )

File ~/.conda/envs/tensorrt-venv/lib/python3.10/site-packages/tensorrt_llm/runtime/generation.py:2735, in GenerationSession.decode_batch(self, input_ids, sampling_config, streaming, **kwargs)
   2729 def decode_batch(self,
   2730                  input_ids: Sequence[torch.Tensor],
   2731                  sampling_config: SamplingConfig,
   2732                  streaming: bool = False,
   2733                  **kwargs):
   2734     input_ids, context_lengths = _prepare_input_ids(input_ids)
-> 2735     return self.decode(input_ids,
   2736                        context_lengths,
   2737                        sampling_config,
   2738                        streaming=streaming,
   2739                        **kwargs)

File ~/.conda/envs/tensorrt-venv/lib/python3.10/site-packages/tensorrt_llm/runtime/generation.py:757, in GenerationSession.cuda_stream_guard.<locals>.wrapper(self, *args, **kwargs)
    755     external_stream.synchronize()
    756     torch.cuda.set_stream(self.stream)
--> 757 ret = func(self, *args, **kwargs)
    758 if external_stream != self.stream:
    759     self.stream.synchronize()

File ~/.conda/envs/tensorrt-venv/lib/python3.10/site-packages/tensorrt_llm/runtime/generation.py:2893, in GenerationSession.decode(self, input_ids, context_lengths, sampling_config, prompt_embedding_table, tasks, prompt_vocab_size, stop_words_list, bad_words_list, no_repeat_ngram_size, streaming, output_sequence_lengths, return_dict, encoder_output, encoder_input_lengths, stopping_criteria, logits_processor, cross_attention_mask, **kwargs)
   2883     return self.decode_stream(
   2884         batch_size, scfg, sequence_lengths, context_lengths,
   2885         host_context_lengths, max_context_length, beam_width,
   (...)
   2890         encoder_output, encoder_input_lengths, stopping_criteria,
   2891         logits_processor, cross_attention_mask, **kwargs)
   2892 else:
-> 2893     return self.decode_regular(
   2894         batch_size, scfg, sequence_lengths, context_lengths,
   2895         host_context_lengths, max_context_length, beam_width,
   2896         cache_indirections, input_ids, hidden_states,
   2897         prompt_embedding_table, tasks, prompt_vocab_size, ite,
   2898         sequence_limit_lengths, stop_words_list, bad_words_list,
   2899         no_repeat_ngram_size, output_sequence_lengths, return_dict,
   2900         encoder_output, encoder_input_lengths, stopping_criteria,
   2901         logits_processor, cross_attention_mask, **kwargs)

File ~/.conda/envs/tensorrt-venv/lib/python3.10/site-packages/tensorrt_llm/runtime/generation.py:2550, in GenerationSession.decode_regular(self, batch_size, scfg, sequence_lengths, context_lengths, host_context_lengths, max_context_length, beam_width, cache_indirections, input_ids, hidden_states, prompt_embedding_table, tasks, prompt_vocab_size, ite, sequence_limit_lengths, stop_words_list, bad_words_list, no_repeat_ngram_size, output_sequence_lengths, return_dict, encoder_output, encoder_input_lengths, stopping_criteria, logits_processor, cross_attention_mask, **kwargs)
   2547 last_token_ids = torch.cumsum(context_lengths.clone().detach(),
   2548                               dim=0).int()
   2549 for step in range(0, self.max_new_tokens):
-> 2550     should_stop, next_step_tensors, tasks, context_lengths, host_context_lengths, attention_mask, logits, encoder_input_lengths = self.handle_per_step(
   2551         cache_indirections, step, batch_size, max_context_length,
   2552         beam_width, input_ids, hidden_states, scfg,
   2553         kv_cache_block_pointers, host_kv_cache_block_pointers,
   2554         prompt_embedding_table, tasks, context_lengths,
   2555         host_context_lengths, attention_mask, cross_attention_mask,
   2556         prompt_vocab_size, ite, sequence_limit_lengths,
   2557         sequence_lengths, next_step_tensors, stop_words_list,
   2558         bad_words_list, no_repeat_ngram_size, encoder_output,
   2559         encoder_input_lengths, stopping_criteria, logits_processor,
   2560         **kwargs)
   2561     if step == 0:
   2562         if benchmark_profiler is not None:

File ~/.conda/envs/tensorrt-venv/lib/python3.10/site-packages/tensorrt_llm/runtime/generation.py:2233, in GenerationSession.handle_per_step(self, cache_indirections, step, batch_size, max_context_length, beam_width, input_ids, hidden_states, scfg, kv_cache_block_pointers, host_kv_cache_block_pointers, prompt_embedding_table, tasks, context_lengths, host_context_lengths, attention_mask, cross_attention_mask, prompt_vocab_size, ite, sequence_limit_lengths, sequence_lengths, next_step_tensors, stop_words_list, bad_words_list, no_repeat_ngram_size, encoder_output, encoder_input_lengths, stopping_criteria, logits_processor, **kwargs)
   2230     ok = self.runtime._run(context, stream)
   2232 if not ok:
-> 2233     raise RuntimeError(f"Executing TRT engine failed step={step}!")
   2234 if self.debug_mode:
   2235     torch.cuda.synchronize()

RuntimeError: Executing TRT engine failed step=0!```

From the error message, it looks your request does not satisfy the constraint you set during building engine. Please check your parameters again.