ChatGLM3 6B Multi-batch Failed with Error

System Info

CPU: INTEL RPL
GPU Name: NVIDIA GTX 4090
TensorRT-LLM: tensorrt_llm==0.11.0.dev2024060400
Container Used: Yes and reproduced in Conda as well
Driver Version: 555.42.02
CUDA Version: 12.5
OS: Ubuntu 24.04
Docker Img: nvidia/cuda:12.5.0-devel-ubuntu22.04

Who can help?

@hijkzzz

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

python3 benchmarks/python/benchmark.py --engine_dir /trt_engines/chatglm3_6b/float16/1-gpu/ --dtype float16 --batch_size 16 --input_output_len "1024,512"

Expected behavior

pass with output

actual behavior

[TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024061100
Allocated 770.13 MiB for execution context memory.
/usr/local/lib/python3.10/dist-packages/torch/nested/__init__.py:166: UserWarning: The PyTorch API of nested tensors is in prototype stage and will change in the near future. (Triggered internally at ../aten/src/ATen/NestedTensorImpl.cpp:178.)
  return _nested.nested_tensor(
[06/13/2024-03:26:19] [TRT] [E] 3: [executionContext.cpp::setInputShape::2068] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::setInputShape::2068, condition: satisfyProfile Runtime dimension does not satisfy any optimization profile.)
[06/13/2024-03:26:19] [TRT] [E] 3: [executionContext.cpp::setInputShape::2068] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::setInputShape::2068, condition: satisfyProfile Runtime dimension does not satisfy any optimization profile.)
[06/13/2024-03:26:19] [TRT] [E] 3: [executionContext.cpp::resolveSlots::2842] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::resolveSlots::2842, condition: allInputDimensionsSpecified(routine) )
Traceback (most recent call last):
  File "/TensorRT-LLM/benchmarks/python/benchmark.py", line 416, in main
    benchmarker.run(inputs, config)
  File "/TensorRT-LLM/benchmarks/python/gpt_benchmark.py", line 254, in run
    self.decoder.decode_batch(inputs[0],
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/generation.py", line 3240, in decode_batch
    return self.decode(input_ids,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/generation.py", line 947, in wrapper
    ret = func(self, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/generation.py", line 3463, in decode
    return self.decode_regular(
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/generation.py", line 3073, in decode_regular
    should_stop, next_step_tensors, tasks, context_lengths, host_context_lengths, attention_mask, context_logits, generation_logits, encoder_input_lengths = self.handle_per_step(
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/generation.py", line 2732, in handle_per_step
    raise RuntimeError(f"Executing TRT engine failed step={step}!")
RuntimeError: Executing TRT engine failed step=0!

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/TensorRT-LLM/benchmarks/python/benchmark.py", line 515, in <module>
    main(args)
  File "/TensorRT-LLM/benchmarks/python/benchmark.py", line 441, in main
    e.with_traceback())
TypeError: BaseException.with_traceback() takes exactly one argument (0 given)
[06/13/2024-03:26:25] [TRT-LLM] [W] Logger level already set from environment. Discard new verbosity: error
[TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024061100
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/lib/python3.10/multiprocessing/spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "/usr/lib/python3.10/multiprocessing/spawn.py", line 126, in _main
    self = reduction.pickle.load(from_parent)
  File "/usr/lib/python3.10/multiprocessing/synchronize.py", line 110, in __setstate__
    self._semlock = _multiprocessing.SemLock._rebuild(*state)
FileNotFoundError: [Errno 2] No such file or directory

additional notes

If I run without --input_output_len, it should be ok.

NVIDIA / TensorRT-LLM

ChatGLM3 6B Multi-batch Failed with Error #1775