NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
7.5k stars 815 forks source link

convert qwen1.5-32b-chat failed #1666

Open Fred-cell opened 1 month ago

Fred-cell commented 1 month ago

qwen# python convert_checkpoint.py --model_dir /code/tensorrt-llm/Qwen1.5-32B-Chat/ --output_dir ./trt_ckpt/qwen1.5-32b/fp16 --dtype float16 --tp_size 4 [TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024051400 0.11.0.dev2024051400 Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 17/17 [00:12<00:00, 1.39it/s] Traceback (most recent call last): File "/code/tensorrt-llm/TensorRT-LLM/examples/qwen/convert_checkpoint.py", line 368, in main() File "/code/tensorrt-llm/TensorRT-LLM/examples/qwen/convert_checkpoint.py", line 360, in main convert_and_save_hf(args) File "/code/tensorrt-llm/TensorRT-LLM/examples/qwen/convert_checkpoint.py", line 322, in convert_and_save_hf execute(args.workers, [convert_and_save_rank] * world_size, args) File "/code/tensorrt-llm/TensorRT-LLM/examples/qwen/convert_checkpoint.py", line 328, in execute f(args, rank) File "/code/tensorrt-llm/TensorRT-LLM/examples/qwen/convert_checkpoint.py", line 308, in convert_and_save_rank qwen = from_hugging_face( File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/qwen/convert.py", line 1086, in from_hugging_face weights = load_weights_from_hf(config=config, File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/qwen/convert.py", line 1192, in load_weights_from_hf weights = convert_hf_qwen( File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/qwen/convert.py", line 688, in convert_hf_qwen assert mha_mode == True, "QWen uses MHA." AssertionError: QWen uses MHA.

byshiue commented 1 month ago

Could you take a try on latest main branch (commit id: f430a4b447ef4cba22698902d43eae0debf08594). The limitation is removed.

Fred-cell commented 1 month ago

convert model successful, inference with benchmark tool as below: what's mean about the ERROR info? BS: 10, ISL/OSL: 1024,512 Benchmarking done. Iteration: 1, duration: 20.32 sec. Latencies: [20317.32] [BENCHMARK] batch_size 10 input_length 1024 output_length 512 latency(ms) 20317.32 tokensPerSec 252.00 generation_time(ms) 15111.57 generationTokensPerSec 338.81 gpu_peak_mem(gb) 25.25 Benchmarking done. Iteration: 1, duration: 20.32 sec. Latencies: [20317.26] [TensorRT-LLM][ERROR] std::future_error: No associated state [TensorRT-LLM][ERROR] std::future_error: No associated state

Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.


mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:

Process name: [[59484,1],1] Exit code: 1

Fred-cell commented 1 month ago

]# mpirun -n 4 --allow-run-as-root /app/tensorrt_llm/benchmarks/cpp/gptSessionBenchmark --engine_dir ... log shows the second iteration failed. @byshiue

Fred-cell commented 1 month ago

@byshiue hi shiue, is any update about this issue?

Fred-cell commented 1 month ago

@jershi425 I build with latest main branch, and the issue is exiting as before.

nv-guomingz commented 1 month ago

Hi @Fred-cell , we're investigating this issue.

byshiue commented 1 month ago

@Fred-cell Could you share the new error log? The limitation is removed from latest codes and I wonder what new issue do you encounter.

Fred-cell commented 1 month ago

the same to before, next iterate encounted error: convert model successful, inference with benchmark tool as below: what's mean about the ERROR info? BS: 10, ISL/OSL: 1024,512 Benchmarking done. Iteration: 1, duration: 20.32 sec. Latencies: [20317.32] [BENCHMARK] batch_size 10 input_length 1024 output_length 512 latency(ms) 20317.32 tokensPerSec 252.00 generation_time(ms) 15111.57 generationTokensPerSec 338.81 gpu_peak_mem(gb) 25.25 Benchmarking done. Iteration: 1, duration: 20.32 sec. Latencies: [20317.26] [TensorRT-LLM][ERROR] std::future_error: No associated state [TensorRT-LLM][ERROR] std::future_error: No associated state Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.

Fred-cell commented 1 month ago

maybe I reproduce by pulling main branch once again?

Fred-cell commented 1 month ago

]# python convert_checkpoint.py --model_dir /code/tensorrt-llm/Qwen1.5-32B-Chat/ --output_dir ./trt_ckpt/qwen1.5-32b/fp16 --dtype float16 --tp_size 4 --qwen_type qwen2 ]# trtllm-build --checkpoint_dir trt_ckpt/qwen1.5-32b/fp16/ --output_dir trtModel/qwen1.5-32b/fp16 --gemm_plugin float16 --tp_size 4 --max_batch_size 8 --max_input_len 1024 --max_output_len 512 --use_custom_all_reduce disable ]# mpirun -n 4 --allow-run-as-root /app/tensorrt_llm/benchmarks/cpp/gptSessionBenchmark --engine_dir ./examples/qwen/trtModel/qwen1.5-32b/fp16 --warm_up 1 --batch_size 8 --duration 0 --num_runs 2 --input_output_len "1024,1" [TensorRT-LLM][INFO] [MemUsage] GPU 6.25 GB, CPU 2.46 KB, Pinned 512.00 MB [TensorRT-LLM][INFO] [MemUsage] GPU 6.25 GB, CPU 2.46 KB, Pinned 512.00 MB [TensorRT-LLM][INFO] [MemUsage] GPU 6.25 GB, CPU 2.46 KB, Pinned 512.00 MB [TensorRT-LLM][INFO] [MemUsage] GPU 6.25 GB, CPU 2.46 KB, Pinned 512.00 MB Benchmarking done. Iteration: 1, duration: 4.12 sec. Latencies: [4119.79] [TensorRT-LLM][INFO] [MemUsage] GPU 6.25 GB, CPU 2.46 KB, Pinned 512.00 MB Benchmarking done. Iteration: 1, duration: 4.13 sec. Latencies: [4132.46] [BENCHMARK] batch_size 8 input_length 1024 output_length 1 latency(ms) 4132.46 tokensPerSec 1.94 generation_time(ms) 0.10 generationTokensPerSec 77351.48 gpu_peak_mem(gb) 25.25 [TensorRT-LLM][INFO] [MemUsage] GPU 6.25 GB, CPU 2.46 KB, Pinned 512.00 MB [TensorRT-LLM][ERROR] std::future_error: No associated state [TensorRT-LLM][ERROR] std::future_error: No associated state

Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.


mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:

Process name: [[11039,1],2] Exit code: 1

byshiue commented 1 month ago

The log should contains all iterations instead of first iteration. So, the program does not exist correctly, but it does not affect the performance number.

Fred-cell commented 1 month ago

when i benchmark 2048,128 and batch 8, there is error as below: [TensorRT-LLM][ERROR] [TensorRT-LLM][ERROR] Assertion failed: The engine is built with remove_input_padding=True and max_num_tokens=8192, while trying to benchmark on 16384 tokens (/src/tensorrt_llm/benchmarks/cpp/gptSessionBenchmark.cpp:140) 1 0x57a607d9e66e tensorrt_llm::common::throwRuntimeError(char const, int, std::cxx11::basic_string<char, std::char_traits, std::allocator > const&) + 100 2 0x57a607d9e19b /app/tensorrt_llm/benchmarks/cpp/gptSessionBenchmark(+0x1919b) [0x57a607d9e19b] 3 0x73887c8c2d90 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x73887c8c2d90] 4 0x73887c8c2e40 libc_start_main + 128 5 0x57a607da77b5 /app/tensorrt_llm/benchmarks/cpp/gptSessionBenchmark(+0x227b5) [0x57a607da77b5] [TensorRT-LLM][ERROR] [TensorRT-LLM][ERROR] Assertion failed: The engine is built with remove_input_padding=True and max_num_tokens=8192, while trying to benchmark on 16384 tokens (/src/tensorrt_llm/benchmarks/cpp/gptSessionBenchmark.cpp:140) 1 0x5fb5e110266e tensorrt_llm::common::throwRuntimeError(char const, int, std::cxx11::basic_string<char, std::char_traits, std::allocator > const&) + 100 2 0x5fb5e110219b /app/tensorrt_llm/benchmarks/cpp/gptSessionBenchmark(+0x1919b) [0x5fb5e110219b] 3 0x7574c9680d90 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7574c9680d90] 4 0x7574c9680e40 libc_start_main + 128 5 0x5fb5e110b7b5 /app/tensorrt_llm/benchmarks/cpp/gptSessionBenchmark(+0x227b5) [0x5fb5e110b7b5] [TensorRT-LLM][ERROR] [TensorRT-LLM][ERROR] Assertion failed: The engine is built with remove_input_padding=True and max_num_tokens=8192, while trying to benchmark on 16384 tokens (/src/tensorrt_llm/benchmarks/cpp/gptSessionBenchmark.cpp:140) 1 0x5914ba8c366e tensorrt_llm::common::throwRuntimeError(char const, int, std::cxx11::basic_string<char, std::char_traits, std::allocator > const&) + 100 2 0x5914ba8c319b /app/tensorrt_llm/benchmarks/cpp/gptSessionBenchmark(+0x1919b) [0x5914ba8c319b] 3 0x72f42d66dd90 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x72f42d66dd90] 4 0x72f42d66de40 libc_start_main + 128 5 0x5914ba8cc7b5 /app/tensorrt_llm/benchmarks/cpp/gptSessionBenchmark(+0x227b5) [0x5914ba8cc7b5] [TensorRT-LLM][ERROR] [TensorRT-LLM][ERROR] Assertion failed: The engine is built with remove_input_padding=True and max_num_tokens=8192, while trying to benchmark on 16384 tokens (/src/tensorrt_llm/benchmarks/cpp/gptSessionBenchmark.cpp:140) 1 0x5b1f935a466e tensorrt_llm::common::throwRuntimeError(char const, int, std::cxx11::basic_string<char, std::char_traits, std::allocator > const&) + 100 2 0x5b1f935a419b /app/tensorrt_llm/benchmarks/cpp/gptSessionBenchmark(+0x1919b) [0x5b1f935a419b] 3 0x7820b235ed90 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7820b235ed90] 4 0x7820b235ee40 libc_start_main + 128 5 0x5b1f935ad7b5 /app/tensorrt_llm/benchmarks/cpp/gptSessionBenchmark(+0x227b5) [0x5b1f935ad7b5]

convert script is as below: trtllm-build --checkpoint_dir trt_ckpt/qwen1.5-32b/fp16/ --output_dir trtModel/qwen1.5-32b/fp16 --gemm_plugin float16 --tp_size 4 --max_batch_size 8 --max_input_len 2048 --max_output_len 128 --use_custom_all_reduce disable

benchmark script is as below: mpirun -n 4 --allow-run-as-root /app/tensorrt_llm/benchmarks/cpp/gptSessionBenchmark --engine_dir ./examples/qwen/trtModel/qwen1.5-32b/fp16 --warm_up 1 --batch_size 8--duration 0 --num_runs 3 --input_output_len "2048,128"

byshiue commented 1 month ago

Please increase the max_num_tokens during building engine or enable the chunked context.

Fred-cell commented 1 month ago

when I increased the max_num_tokens during building engine, error is as below: ]# trtllm-build --checkpoint_dir trt_ckpt/qwen1.5-32b/fp16/ --output_dir trtModel/qwen1.5-32b/fp16 --gemm_plugin float16 --tp_size 4 --max_batch_size 8 --max_input_len 2048 --max_output_len 128 --use_custom_all_reduce disable --max_num_tokens 128 BS: 8, ISL/OSL: 1024,1

[TensorRT-LLM][ERROR] [TensorRT-LLM][ERROR] Assertion failed: The engine is built with remove_input_padding=True and max_num_tokens=128, while trying to benchmark on 8192 tokens (/src/tensorrt_llm/benchmarks/cpp/gptSessionBenchmark.cpp:140)
1       0x61fa3ec6566e tensorrt_llm::common::throwRuntimeError(char const*, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 100
2       0x61fa3ec6519b /app/tensorrt_llm/benchmarks/cpp/gptSessionBenchmark(+0x1919b) [0x61fa3ec6519b]
3       0x75b7df30dd90 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x75b7df30dd90]
4       0x75b7df30de40 __libc_start_main + 128
5       0x61fa3ec6e7b5 /app/tensorrt_llm/benchmarks/cpp/gptSessionBenchmark(+0x227b5) [0x61fa3ec6e7b5]
[TensorRT-LLM][ERROR] [TensorRT-LLM][ERROR] Assertion failed: The engine is built with remove_input_padding=True and max_num_tokens=128, while trying to benchmark on 8192 tokens (/src/tensorrt_llm/benchmarks/cpp/gptSessionBenchmark.cpp:140)
1       0x5ce77990166e tensorrt_llm::common::throwRuntimeError(char const*, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 100
2       0x5ce77990119b /app/tensorrt_llm/benchmarks/cpp/gptSessionBenchmark(+0x1919b) [0x5ce77990119b]
3       0x7521d92cad90 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7521d92cad90]
4       0x7521d92cae40 __libc_start_main + 128
5       0x5ce77990a7b5 /app/tensorrt_llm/benchmarks/cpp/gptSessionBenchmark(+0x227b5) [0x5ce77990a7b5]
[TensorRT-LLM][ERROR] [TensorRT-LLM][ERROR] Assertion failed: The engine is built with remove_input_padding=True and max_num_tokens=128, while trying to benchmark on 8192 tokens (/src/tensorrt_llm/benchmarks/cpp/gptSessionBenchmark.cpp:140)
1       0x589babce466e tensorrt_llm::common::throwRuntimeError(char const*, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 100
2       0x589babce419b /app/tensorrt_llm/benchmarks/cpp/gptSessionBenchmark(+0x1919b) [0x589babce419b]
3       0x7e711a830d90 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7e711a830d90]
4       0x7e711a830e40 __libc_start_main + 128
5       0x589babced7b5 /app/tensorrt_llm/benchmarks/cpp/gptSessionBenchmark(+0x227b5) [0x589babced7b5]
[TensorRT-LLM][ERROR] [TensorRT-LLM][ERROR] Assertion failed: The engine is built with remove_input_padding=True and max_num_tokens=128, while trying to benchmark on 8192 tokens (/src/tensorrt_llm/benchmarks/cpp/gptSessionBenchmark.cpp:140)
1       0x5c72eef5466e tensorrt_llm::common::throwRuntimeError(char const*, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 100
2       0x5c72eef5419b /app/tensorrt_llm/benchmarks/cpp/gptSessionBenchmark(+0x1919b) [0x5c72eef5419b]
3       0x70393fc0ed90 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x70393fc0ed90]
4       0x70393fc0ee40 __libc_start_main + 128
5       0x5c72eef5d7b5 /app/tensorrt_llm/benchmarks/cpp/gptSessionBenchmark(+0x227b5) [0x5c72eef5d7b5]
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[19922,1],0]
  Exit code:    1
kevin-t-tang commented 1 month ago

Run benchmark with batch size > 1 fail, please help check

[06/17/2024-00:51:04] [TRT] [E] 3: [executionContext.cpp::setInputShape::2068] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/execution Context.cpp::setInputShape::2068, condition: satisfyProfile Runtime dimension does not satisfy any optimization profile.) [06/17/2024-00:51:04] [TRT] [E] 3: [executionContext.cpp::setInputShape::2068] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/execution Context.cpp::setInputShape::2068, condition: satisfyProfile Runtime dimension does not satisfy any optimization profile.) Traceback (most recent call last): File "/TensorRT-LLM/benchmarks/python/benchmark.py", line 416, in main benchmarker.run(inputs, config) File "/TensorRT-LLM/benchmarks/python/gpt_benchmark.py", line 254, in run [06/17/2024-00:51:04] [TRT] [E] 3: [executionContext.cpp::resolveSlots::2842] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionC ontext.cpp::resolveSlots::2842, condition: allInputDimensionsSpecified(routine) ) self.decoder.decode_batch(inputs[0], File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/generation.py", line 3240, in decode_batch return self.decode(input_ids, File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/generation.py", line 947, in wrapper ret = func(self, *args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/generation.py", line 3463, in decode return self.decode_regular( File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/generation.py", line 3073, in decode_regular should_stop, next_step_tensors, tasks, context_lengths, host_context_lengths, attention_mask, context_logits, generation_logits, encoder_input_lengths = s elf.handle_per_step( File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/generation.py", line 2732, in handle_per_step raise RuntimeError(f"Executing TRT engine failed step={step}!") RuntimeError: Executing TRT engine failed step=0!

Convert

python3 examples/qwen/convert_checkpoint.py --model_dir Qwen1.5-32B/ --output_dir /ckpt/Qwen/32B/tllm_checkpoint_gpu_int4 --dtype float16 \ --use_weight_only --weight_only_precision int4 --tp_size 2 --load_model_on_cpu --qwen_type qwen2

Build

trtllm-build --checkpoint_dir /ckpt/Qwen/32B/tllm_checkpoint_gpu_int4/ --output_dir /trt_engines/Qwen/32B/int4/2-gpu --gemm_plugin float16 \ --max_num_tokens 512 --max_batch_size 4 --tp_size 2 --use_custom_all_reduce disable --workers 2 --max_input_len 1024 --max_output_len 512

Benchmark

mpirun -n 2 --allow-run-as-root python3 benchmarks/python/benchmark.py -m qwen1.5_7b_chat --engine_dir /trt_engines/Qwen/32B/int4/2-gpu/ --batch_size 4 --input_output_len "2048,512"

jershi425 commented 3 weeks ago

@Fred-cell Please use a larger max_num_tokens or simply DELETE IT when building your engine. It has to be bigger than batch_size * (max_input_len + max_output_len), in your case, 8192. @kevin-t-tang Same in your case, please DELETE max_num_tokens. Also, please make sure the input len during benchmark is less or equal to what you used during engine building. You specify 2048 there which is bigger than 1024 during your engine build.

Fred-cell commented 1 week ago

reproduced it based v0.10.0 version, and the error is still existing as above: trtllm-build --checkpoint_dir trt_ckpt/qwen1.5-32b/fp16/ --output_dir trtModel/qwen1.5-32b/fp16 --gemm_plugin float16 --tp_size 4 --max_batch_size 1 --max_input_len 8096 --max_output_len 512 --use_custom_all_reduce disable --max_num_tokens 8704

jershi425 commented 1 week ago

reproduced it based v0.10.0 version, and the error is still existing as above: trtllm-build --checkpoint_dir trt_ckpt/qwen1.5-32b/fp16/ --output_dir trtModel/qwen1.5-32b/fp16 --gemm_plugin float16 --tp_size 4 --max_batch_size 1 --max_input_len 8096 --max_output_len 512 --use_custom_all_reduce disable --max_num_tokens 8704

@Fred-cell What is the benchmark command? Let me try to reproduce it.