Error in build.py Mixtral-8x7B-instruct-v0.1 after int8_kv_cache

System Info

I get docker container The version of TensorRT-LLM is v0.7.1

Who can help?

No response

Information

[X] The official example scripts
[x] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

I run the examples/llama/hf_llama_convert.py as below: python3 hf_llama_convert.py -i ~/Mixtral-8x7B-Instruct-v0.1 -o ~/mixtral/int8_kv_cache --calibrate-kv-cache -t fp16
Then I run the examples/llama/build.py as follows: python build.py --bin_model_dir=/app/tensorrt_llm/examples/mixtral/int8_kv_cache/1-gpu \ --dtype float16 \ --use_gpt_attention_plugin float16 \ --remove_input_padding \ --use_gemm_plugin float16 \ --output_dir /app/tensorrt_llm/examples/mixtral/int8_kv_cache_weight_only \ --int8_kv_cache \ --use_weight_only \ --parallel_build \ --world_size 2 \ --pp_size 2 \ --enable_context_fmha \ --multi_block_mode \ --max_input_len 32768 \ --max_output_len 16384 3.The result is that: [01/30/2024-06:37:13] [TRT-LLM] [W] Set rms_norm_eps to 1e-06 directly. [01/30/2024-06:37:13] [TRT-LLM] [W] Parallelly build TensorRT engines. Please make sure that all of the 2 GPUs are totally free. [01/30/2024-06:37:24] [TRT] [I] [MemUsageChange] Init CUDA: CPU +13, GPU +0, now: CPU 141, GPU 421 (MiB) [01/30/2024-06:37:24] [TRT] [I] [MemUsageChange] Init CUDA: CPU +2, GPU +0, now: CPU 147, GPU 421 (MiB) [01/30/2024-06:37:26] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +1973, GPU +350, now: CPU 2250, GPU 771 (MiB) [01/30/2024-06:37:26] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +1974, GPU +350, now: CPU 2256, GPU 771 (MiB) [01/30/2024-06:37:26] [TRT-LLM] [W] Invalid timing cache, using freshly created one [01/30/2024-06:37:26] [TRT-LLM] [W] Invalid timing cache, using freshly created one [01/30/2024-06:37:26] [TRT-LLM] [I] [MemUsage] Rank 0 Engine build starts - Allocated Memory: Host 2.2701 (GiB) Device 43.4521 (GiB) [01/30/2024-06:37:26] [TRT-LLM] [I] [MemUsage] Rank 1 Engine build starts - Allocated Memory: Host 2.2741 (GiB) Device 10.1143 (GiB) [01/30/2024-06:37:26] [TRT-LLM] [I] Loading weights from binary... [01/30/2024-06:37:26] [TRT-LLM] [I] Loading weights from binary... Traceback (most recent call last): File "/app/tensorrt_llm/examples/llama/build.py", line 1047, in mp.spawn(build, nprocs=args.world_size, args=(args, )) File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 246, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method="spawn") File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 202, in start_processes while not context.join(): File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 163, in join raise ProcessRaisedException(msg, error_index, failed_process.pid) torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 1 terminated with the following error: Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 74, in _wrap fn(i, *args) File "/app/tensorrt_llm/examples/llama/build.py", line 995, in build engine = build_rank_engine(builder, builder_config, engine_name, File "/app/tensorrt_llm/examples/llama/build.py", line 873, in build_rank_engine tensorrt_llm_llama = get_model_object(args, File "/app/tensorrt_llm/examples/llama/build.py", line 795, in get_model_object load_from_binary(tensorrt_llm_llama, File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/weight.py", line 1015, in load_from_binary t = fromfile( File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/weight.py", line 900, in fromfile t = t.reshape(shape) ValueError: cannot reshape array of size 25165824 into shape (4096,5120)

Expected behavior

build success

actual behavior

build error

additional notes

Please provide the way to solve the error.

NVIDIA / TensorRT-LLM