Open Fred-cell opened 1 month ago
Could you take a try on latest main branch (commit id: f430a4b447ef4cba22698902d43eae0debf08594). The limitation is removed.
mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:
]# mpirun -n 4 --allow-run-as-root /app/tensorrt_llm/benchmarks/cpp/gptSessionBenchmark --engine_dir ... log shows the second iteration failed. @byshiue
@byshiue hi shiue, is any update about this issue?
@jershi425 I build with latest main branch, and the issue is exiting as before.
Hi @Fred-cell , we're investigating this issue.
@Fred-cell Could you share the new error log? The limitation is removed from latest codes and I wonder what new issue do you encounter.
the same to before, next iterate encounted error: convert model successful, inference with benchmark tool as below: what's mean about the ERROR info? BS: 10, ISL/OSL: 1024,512 Benchmarking done. Iteration: 1, duration: 20.32 sec. Latencies: [20317.32] [BENCHMARK] batch_size 10 input_length 1024 output_length 512 latency(ms) 20317.32 tokensPerSec 252.00 generation_time(ms) 15111.57 generationTokensPerSec 338.81 gpu_peak_mem(gb) 25.25 Benchmarking done. Iteration: 1, duration: 20.32 sec. Latencies: [20317.26] [TensorRT-LLM][ERROR] std::future_error: No associated state [TensorRT-LLM][ERROR] std::future_error: No associated state Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.
maybe I reproduce by pulling main branch once again?
mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:
The log should contains all iterations instead of first iteration. So, the program does not exist correctly, but it does not affect the performance number.
when i benchmark 2048,128 and batch 8, there is error as below:
[TensorRT-LLM][ERROR] [TensorRT-LLM][ERROR] Assertion failed: The engine is built with remove_input_padding=True and max_num_tokens=8192, while trying to benchmark on 16384 tokens (/src/tensorrt_llm/benchmarks/cpp/gptSessionBenchmark.cpp:140)
1 0x57a607d9e66e tensorrt_llm::common::throwRuntimeError(char const, int, std::cxx11::basic_string<char, std::char_traits
convert script is as below: trtllm-build --checkpoint_dir trt_ckpt/qwen1.5-32b/fp16/ --output_dir trtModel/qwen1.5-32b/fp16 --gemm_plugin float16 --tp_size 4 --max_batch_size 8 --max_input_len 2048 --max_output_len 128 --use_custom_all_reduce disable
benchmark script is as below: mpirun -n 4 --allow-run-as-root /app/tensorrt_llm/benchmarks/cpp/gptSessionBenchmark --engine_dir ./examples/qwen/trtModel/qwen1.5-32b/fp16 --warm_up 1 --batch_size 8--duration 0 --num_runs 3 --input_output_len "2048,128"
Please increase the max_num_tokens
during building engine or enable the chunked context.
when I increased the max_num_tokens during building engine, error is as below: ]# trtllm-build --checkpoint_dir trt_ckpt/qwen1.5-32b/fp16/ --output_dir trtModel/qwen1.5-32b/fp16 --gemm_plugin float16 --tp_size 4 --max_batch_size 8 --max_input_len 2048 --max_output_len 128 --use_custom_all_reduce disable --max_num_tokens 128 BS: 8, ISL/OSL: 1024,1
[TensorRT-LLM][ERROR] [TensorRT-LLM][ERROR] Assertion failed: The engine is built with remove_input_padding=True and max_num_tokens=128, while trying to benchmark on 8192 tokens (/src/tensorrt_llm/benchmarks/cpp/gptSessionBenchmark.cpp:140)
1 0x61fa3ec6566e tensorrt_llm::common::throwRuntimeError(char const*, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 100
2 0x61fa3ec6519b /app/tensorrt_llm/benchmarks/cpp/gptSessionBenchmark(+0x1919b) [0x61fa3ec6519b]
3 0x75b7df30dd90 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x75b7df30dd90]
4 0x75b7df30de40 __libc_start_main + 128
5 0x61fa3ec6e7b5 /app/tensorrt_llm/benchmarks/cpp/gptSessionBenchmark(+0x227b5) [0x61fa3ec6e7b5]
[TensorRT-LLM][ERROR] [TensorRT-LLM][ERROR] Assertion failed: The engine is built with remove_input_padding=True and max_num_tokens=128, while trying to benchmark on 8192 tokens (/src/tensorrt_llm/benchmarks/cpp/gptSessionBenchmark.cpp:140)
1 0x5ce77990166e tensorrt_llm::common::throwRuntimeError(char const*, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 100
2 0x5ce77990119b /app/tensorrt_llm/benchmarks/cpp/gptSessionBenchmark(+0x1919b) [0x5ce77990119b]
3 0x7521d92cad90 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7521d92cad90]
4 0x7521d92cae40 __libc_start_main + 128
5 0x5ce77990a7b5 /app/tensorrt_llm/benchmarks/cpp/gptSessionBenchmark(+0x227b5) [0x5ce77990a7b5]
[TensorRT-LLM][ERROR] [TensorRT-LLM][ERROR] Assertion failed: The engine is built with remove_input_padding=True and max_num_tokens=128, while trying to benchmark on 8192 tokens (/src/tensorrt_llm/benchmarks/cpp/gptSessionBenchmark.cpp:140)
1 0x589babce466e tensorrt_llm::common::throwRuntimeError(char const*, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 100
2 0x589babce419b /app/tensorrt_llm/benchmarks/cpp/gptSessionBenchmark(+0x1919b) [0x589babce419b]
3 0x7e711a830d90 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7e711a830d90]
4 0x7e711a830e40 __libc_start_main + 128
5 0x589babced7b5 /app/tensorrt_llm/benchmarks/cpp/gptSessionBenchmark(+0x227b5) [0x589babced7b5]
[TensorRT-LLM][ERROR] [TensorRT-LLM][ERROR] Assertion failed: The engine is built with remove_input_padding=True and max_num_tokens=128, while trying to benchmark on 8192 tokens (/src/tensorrt_llm/benchmarks/cpp/gptSessionBenchmark.cpp:140)
1 0x5c72eef5466e tensorrt_llm::common::throwRuntimeError(char const*, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 100
2 0x5c72eef5419b /app/tensorrt_llm/benchmarks/cpp/gptSessionBenchmark(+0x1919b) [0x5c72eef5419b]
3 0x70393fc0ed90 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x70393fc0ed90]
4 0x70393fc0ee40 __libc_start_main + 128
5 0x5c72eef5d7b5 /app/tensorrt_llm/benchmarks/cpp/gptSessionBenchmark(+0x227b5) [0x5c72eef5d7b5]
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[19922,1],0]
Exit code: 1
Run benchmark with batch size > 1 fail, please help check
[06/17/2024-00:51:04] [TRT] [E] 3: [executionContext.cpp::setInputShape::2068] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/execution Context.cpp::setInputShape::2068, condition: satisfyProfile Runtime dimension does not satisfy any optimization profile.) [06/17/2024-00:51:04] [TRT] [E] 3: [executionContext.cpp::setInputShape::2068] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/execution Context.cpp::setInputShape::2068, condition: satisfyProfile Runtime dimension does not satisfy any optimization profile.) Traceback (most recent call last): File "/TensorRT-LLM/benchmarks/python/benchmark.py", line 416, in main benchmarker.run(inputs, config) File "/TensorRT-LLM/benchmarks/python/gpt_benchmark.py", line 254, in run [06/17/2024-00:51:04] [TRT] [E] 3: [executionContext.cpp::resolveSlots::2842] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionC ontext.cpp::resolveSlots::2842, condition: allInputDimensionsSpecified(routine) ) self.decoder.decode_batch(inputs[0], File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/generation.py", line 3240, in decode_batch return self.decode(input_ids, File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/generation.py", line 947, in wrapper ret = func(self, *args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/generation.py", line 3463, in decode return self.decode_regular( File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/generation.py", line 3073, in decode_regular should_stop, next_step_tensors, tasks, context_lengths, host_context_lengths, attention_mask, context_logits, generation_logits, encoder_input_lengths = s elf.handle_per_step( File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/generation.py", line 2732, in handle_per_step raise RuntimeError(f"Executing TRT engine failed step={step}!") RuntimeError: Executing TRT engine failed step=0!
python3 examples/qwen/convert_checkpoint.py --model_dir Qwen1.5-32B/ --output_dir /ckpt/Qwen/32B/tllm_checkpoint_gpu_int4 --dtype float16 \ --use_weight_only --weight_only_precision int4 --tp_size 2 --load_model_on_cpu --qwen_type qwen2
trtllm-build --checkpoint_dir /ckpt/Qwen/32B/tllm_checkpoint_gpu_int4/ --output_dir /trt_engines/Qwen/32B/int4/2-gpu --gemm_plugin float16 \ --max_num_tokens 512 --max_batch_size 4 --tp_size 2 --use_custom_all_reduce disable --workers 2 --max_input_len 1024 --max_output_len 512
mpirun -n 2 --allow-run-as-root python3 benchmarks/python/benchmark.py -m qwen1.5_7b_chat --engine_dir /trt_engines/Qwen/32B/int4/2-gpu/ --batch_size 4 --input_output_len "2048,512"
@Fred-cell Please use a larger max_num_tokens
or simply DELETE IT when building your engine. It has to be bigger than batch_size
* (max_input_len
+ max_output_len
), in your case, 8192. @kevin-t-tang Same in your case, please DELETE max_num_tokens
. Also, please make sure the input len during benchmark is less or equal to what you used during engine building. You specify 2048 there which is bigger than 1024 during your engine build.
reproduced it based v0.10.0 version, and the error is still existing as above: trtllm-build --checkpoint_dir trt_ckpt/qwen1.5-32b/fp16/ --output_dir trtModel/qwen1.5-32b/fp16 --gemm_plugin float16 --tp_size 4 --max_batch_size 1 --max_input_len 8096 --max_output_len 512 --use_custom_all_reduce disable --max_num_tokens 8704
reproduced it based v0.10.0 version, and the error is still existing as above: trtllm-build --checkpoint_dir trt_ckpt/qwen1.5-32b/fp16/ --output_dir trtModel/qwen1.5-32b/fp16 --gemm_plugin float16 --tp_size 4 --max_batch_size 1 --max_input_len 8096 --max_output_len 512 --use_custom_all_reduce disable --max_num_tokens 8704
@Fred-cell What is the benchmark command? Let me try to reproduce it.
qwen# python convert_checkpoint.py --model_dir /code/tensorrt-llm/Qwen1.5-32B-Chat/ --output_dir ./trt_ckpt/qwen1.5-32b/fp16 --dtype float16 --tp_size 4 [TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024051400 0.11.0.dev2024051400 Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 17/17 [00:12<00:00, 1.39it/s] Traceback (most recent call last): File "/code/tensorrt-llm/TensorRT-LLM/examples/qwen/convert_checkpoint.py", line 368, in
main()
File "/code/tensorrt-llm/TensorRT-LLM/examples/qwen/convert_checkpoint.py", line 360, in main
convert_and_save_hf(args)
File "/code/tensorrt-llm/TensorRT-LLM/examples/qwen/convert_checkpoint.py", line 322, in convert_and_save_hf
execute(args.workers, [convert_and_save_rank] * world_size, args)
File "/code/tensorrt-llm/TensorRT-LLM/examples/qwen/convert_checkpoint.py", line 328, in execute
f(args, rank)
File "/code/tensorrt-llm/TensorRT-LLM/examples/qwen/convert_checkpoint.py", line 308, in convert_and_save_rank
qwen = from_hugging_face(
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/qwen/convert.py", line 1086, in from_hugging_face
weights = load_weights_from_hf(config=config,
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/qwen/convert.py", line 1192, in load_weights_from_hf
weights = convert_hf_qwen(
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/qwen/convert.py", line 688, in convert_hf_qwen
assert mha_mode == True, "QWen uses MHA."
AssertionError: QWen uses MHA.