TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
I get docker container
The version of TensorRT-LLM is v0.7.1
Who can help?
No response
Information
[X] The official example scripts
[x] My own modified scripts
Tasks
[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)
Reproduction
I run the examples/llama/hf_llama_convert.py as below:
python3 hf_llama_convert.py -i ~/Mixtral-8x7B-Instruct-v0.1 -o ~/mixtral/int8_kv_cache --calibrate-kv-cache -t fp16
Then I run the examples/llama/build.py as follows:
python build.py --bin_model_dir=/app/tensorrt_llm/examples/mixtral/int8_kv_cache/1-gpu \
--dtype float16 \
--use_gpt_attention_plugin float16 \
--remove_input_padding \
--use_gemm_plugin float16 \
--output_dir /app/tensorrt_llm/examples/mixtral/int8_kv_cache_weight_only \
--int8_kv_cache \
--use_weight_only \
--parallel_build \
--world_size 2 \
--pp_size 2 \
--enable_context_fmha \
--multi_block_mode \
--max_input_len 32768 \
--max_output_len 16384
3.The result is that:
[01/30/2024-06:37:13] [TRT-LLM] [W] Set rms_norm_eps to 1e-06 directly.
[01/30/2024-06:37:13] [TRT-LLM] [W] Parallelly build TensorRT engines. Please make sure that all of the 2 GPUs are totally free.
[01/30/2024-06:37:24] [TRT] [I] [MemUsageChange] Init CUDA: CPU +13, GPU +0, now: CPU 141, GPU 421 (MiB)
[01/30/2024-06:37:24] [TRT] [I] [MemUsageChange] Init CUDA: CPU +2, GPU +0, now: CPU 147, GPU 421 (MiB)
[01/30/2024-06:37:26] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +1973, GPU +350, now: CPU 2250, GPU 771 (MiB)
[01/30/2024-06:37:26] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +1974, GPU +350, now: CPU 2256, GPU 771 (MiB)
[01/30/2024-06:37:26] [TRT-LLM] [W] Invalid timing cache, using freshly created one
[01/30/2024-06:37:26] [TRT-LLM] [W] Invalid timing cache, using freshly created one
[01/30/2024-06:37:26] [TRT-LLM] [I] [MemUsage] Rank 0 Engine build starts - Allocated Memory: Host 2.2701 (GiB) Device 43.4521 (GiB)
[01/30/2024-06:37:26] [TRT-LLM] [I] [MemUsage] Rank 1 Engine build starts - Allocated Memory: Host 2.2741 (GiB) Device 10.1143 (GiB)
[01/30/2024-06:37:26] [TRT-LLM] [I] Loading weights from binary...
[01/30/2024-06:37:26] [TRT-LLM] [I] Loading weights from binary...
Traceback (most recent call last):
File "/app/tensorrt_llm/examples/llama/build.py", line 1047, in
mp.spawn(build, nprocs=args.world_size, args=(args, ))
File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 246, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 202, in start_processes
while not context.join():
File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 163, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 74, in _wrap
fn(i, *args)
File "/app/tensorrt_llm/examples/llama/build.py", line 995, in build
engine = build_rank_engine(builder, builder_config, engine_name,
File "/app/tensorrt_llm/examples/llama/build.py", line 873, in build_rank_engine
tensorrt_llm_llama = get_model_object(args,
File "/app/tensorrt_llm/examples/llama/build.py", line 795, in get_model_object
load_from_binary(tensorrt_llm_llama,
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/weight.py", line 1015, in load_from_binary
t = fromfile(
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/weight.py", line 900, in fromfile
t = t.reshape(shape)
ValueError: cannot reshape array of size 25165824 into shape (4096,5120)
hi @lwj2001 the latest code branch doesn't have such issue, would u please have a try now?
And do u still have further issue or question now? If not, we'll close it soon.
System Info
I get docker container The version of TensorRT-LLM is v0.7.1
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
-- Process 1 terminated with the following error: Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 74, in _wrap fn(i, *args) File "/app/tensorrt_llm/examples/llama/build.py", line 995, in build engine = build_rank_engine(builder, builder_config, engine_name, File "/app/tensorrt_llm/examples/llama/build.py", line 873, in build_rank_engine tensorrt_llm_llama = get_model_object(args, File "/app/tensorrt_llm/examples/llama/build.py", line 795, in get_model_object load_from_binary(tensorrt_llm_llama, File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/weight.py", line 1015, in load_from_binary t = fromfile( File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/weight.py", line 900, in fromfile t = t.reshape(shape) ValueError: cannot reshape array of size 25165824 into shape (4096,5120)
Expected behavior
build success
actual behavior
build error
additional notes
Please provide the way to solve the error.