TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
[TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024061100
Allocated 770.13 MiB for execution context memory.
/usr/local/lib/python3.10/dist-packages/torch/nested/__init__.py:166: UserWarning: The PyTorch API of nested tensors is in prototype stage and will change in the near future. (Triggered internally at ../aten/src/ATen/NestedTensorImpl.cpp:178.)
return _nested.nested_tensor(
[06/13/2024-03:26:19] [TRT] [E] 3: [executionContext.cpp::setInputShape::2068] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::setInputShape::2068, condition: satisfyProfile Runtime dimension does not satisfy any optimization profile.)
[06/13/2024-03:26:19] [TRT] [E] 3: [executionContext.cpp::setInputShape::2068] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::setInputShape::2068, condition: satisfyProfile Runtime dimension does not satisfy any optimization profile.)
[06/13/2024-03:26:19] [TRT] [E] 3: [executionContext.cpp::resolveSlots::2842] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::resolveSlots::2842, condition: allInputDimensionsSpecified(routine) )
Traceback (most recent call last):
File "/TensorRT-LLM/benchmarks/python/benchmark.py", line 416, in main
benchmarker.run(inputs, config)
File "/TensorRT-LLM/benchmarks/python/gpt_benchmark.py", line 254, in run
self.decoder.decode_batch(inputs[0],
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/generation.py", line 3240, in decode_batch
return self.decode(input_ids,
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/generation.py", line 947, in wrapper
ret = func(self, *args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/generation.py", line 3463, in decode
return self.decode_regular(
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/generation.py", line 3073, in decode_regular
should_stop, next_step_tensors, tasks, context_lengths, host_context_lengths, attention_mask, context_logits, generation_logits, encoder_input_lengths = self.handle_per_step(
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/generation.py", line 2732, in handle_per_step
raise RuntimeError(f"Executing TRT engine failed step={step}!")
RuntimeError: Executing TRT engine failed step=0!
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/TensorRT-LLM/benchmarks/python/benchmark.py", line 515, in <module>
main(args)
File "/TensorRT-LLM/benchmarks/python/benchmark.py", line 441, in main
e.with_traceback())
TypeError: BaseException.with_traceback() takes exactly one argument (0 given)
[06/13/2024-03:26:25] [TRT-LLM] [W] Logger level already set from environment. Discard new verbosity: error
[TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024061100
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 126, in _main
self = reduction.pickle.load(from_parent)
File "/usr/lib/python3.10/multiprocessing/synchronize.py", line 110, in __setstate__
self._semlock = _multiprocessing.SemLock._rebuild(*state)
FileNotFoundError: [Errno 2] No such file or directory
additional notes
If I run without --input_output_len, it should be ok.
System Info
Who can help?
@hijkzzz
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
python3 benchmarks/python/benchmark.py --engine_dir /trt_engines/chatglm3_6b/float16/1-gpu/ --dtype float16 --batch_size 16 --input_output_len "1024,512"
Expected behavior
pass with output
actual behavior
additional notes
If I run without
--input_output_len
, it should be ok.