Open mfournioux opened 7 months ago
@nv-guomingz Could you please take a look? Thanks
hi @mfournioux , I can't reproduce this issue on our latest main branch, would u please have a try? If the issue still exists, please let us know.
I have the same issue .
(.venv) F:\pythonprograms\llmstreaming>trtllm-build --checkpoint_dir F:\pythonprograms\llmstreaming\tllm_checkpoint_1gpu_streamingllm --output_dir ./mistralengine_streaming --gemm_plugin float16
[TensorRT-LLM] TensorRT-LLM version: 0.8.0[03/15/2024-16:14:22] [TRT-LLM] [I] Set bert_attention_plugin to float16.
[03/15/2024-16:14:22] [TRT-LLM] [I] Set gpt_attention_plugin to float16.
[03/15/2024-16:14:22] [TRT-LLM] [I] Set gemm_plugin to float16.
[03/15/2024-16:14:22] [TRT-LLM] [I] Set lookup_plugin to None.
[03/15/2024-16:14:22] [TRT-LLM] [I] Set lora_plugin to None.
[03/15/2024-16:14:22] [TRT-LLM] [I] Set context_fmha to True.
[03/15/2024-16:14:22] [TRT-LLM] [I] Set context_fmha_fp32_acc to False.
[03/15/2024-16:14:22] [TRT-LLM] [I] Set paged_kv_cache to True.
[03/15/2024-16:14:22] [TRT-LLM] [I] Set remove_input_padding to True.
[03/15/2024-16:14:22] [TRT-LLM] [I] Set use_custom_all_reduce to True.
[03/15/2024-16:14:22] [TRT-LLM] [I] Set multi_block_mode to False.
[03/15/2024-16:14:22] [TRT-LLM] [I] Set enable_xqa to True.
[03/15/2024-16:14:22] [TRT-LLM] [I] Set attention_qk_half_accumulation to False.
[03/15/2024-16:14:22] [TRT-LLM] [I] Set tokens_per_block to 128.
[03/15/2024-16:14:22] [TRT-LLM] [I] Set use_paged_context_fmha to False.
[03/15/2024-16:14:22] [TRT-LLM] [I] Set use_context_fmha_for_generation to False.
[03/15/2024-16:14:22] [TRT-LLM] [W] remove_input_padding is enabled, while max_num_tokens is not set, setting to max_batch_sizemax_input_len.
It may not be optimal to set max_num_tokens=max_batch_sizemax_input_len when remove_input_padding is enabled, because the number of packed input tokens are very likely to be smaller, we strongly recommend to set max_num_tokens according to your workloads.
Traceback (most recent call last):
File "C:\Users\RayBe\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\Users\RayBe\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in _run_code
exec(code, run_globals)
File "F:\pythonprograms\llmstreaming.venv\Scripts\trtllm-build.exe__main__.py", line 7, in
We are folloinw the https://console.brev.dev/notebook/streamingllm-tensorrt-llm on windows! Urg.. There seems to be a mismatch versions of what is installed .
I put two print statments in the PretainedModel.py .. thinknig is you don't support Mistral v2 and Mixtral!
(.venv) F:\pythonprograms\llmstreaming>trtllm-build --checkpoint_dir F:\pythonprograms\llmstreaming\tllm_checkpoint_1gpu_streamingllm --output_dir ./mistralengine_streaming --gemm_plugin float16
[TensorRT-LLM] TensorRT-LLM version: 0.8.0[03/15/2024-16:38:54] [TRT-LLM] [I] Set bert_attention_plugin to float16.
[03/15/2024-16:38:54] [TRT-LLM] [I] Set gpt_attention_plugin to float16.
[03/15/2024-16:38:54] [TRT-LLM] [I] Set gemm_plugin to float16.
[03/15/2024-16:38:54] [TRT-LLM] [I] Set lookup_plugin to None.
[03/15/2024-16:38:54] [TRT-LLM] [I] Set lora_plugin to None.
[03/15/2024-16:38:54] [TRT-LLM] [I] Set context_fmha to True.
[03/15/2024-16:38:54] [TRT-LLM] [I] Set context_fmha_fp32_acc to False.
[03/15/2024-16:38:54] [TRT-LLM] [I] Set paged_kv_cache to True.
[03/15/2024-16:38:54] [TRT-LLM] [I] Set remove_input_padding to True.
[03/15/2024-16:38:54] [TRT-LLM] [I] Set use_custom_all_reduce to True.
[03/15/2024-16:38:54] [TRT-LLM] [I] Set multi_block_mode to False.
[03/15/2024-16:38:54] [TRT-LLM] [I] Set enable_xqa to True.
[03/15/2024-16:38:54] [TRT-LLM] [I] Set attention_qk_half_accumulation to False.
[03/15/2024-16:38:54] [TRT-LLM] [I] Set tokens_per_block to 128.
[03/15/2024-16:38:54] [TRT-LLM] [I] Set use_paged_context_fmha to False.
[03/15/2024-16:38:54] [TRT-LLM] [I] Set use_context_fmha_for_generation to False.
[03/15/2024-16:38:54] [TRT-LLM] [W] remove_input_padding is enabled, while max_num_tokens is not set, setting to max_batch_sizemax_input_len.
It may not be optimal to set max_num_tokens=max_batch_sizemax_input_len when remove_input_padding is enabled, because the number of packed input tokens are very likely to be smaller, we strongly recommend to set max_num_tokens according to your workloads.
line 332 load PretrainedModel file we are passing == {'transformer.layers.1.attention.qkv.weight', 'transformer.layers.0.mlp.gate.weight', 'transformer.layers.3.attention.dense.weight', 'transformer.layers.7.input_layernorm.weight', 'transformer.layers.5.input_layernorm.weight', 'transformer.layers.4.attention.dense.weight', 'transformer.layers.5.mlp.fc.weight', 'transformer.layers.12.attention.dense.weight', 'transformer.layers.11.attention.qkv.weight', 'transformer.layers.1.mlp.proj.weight', 'transformer.layers.0.mlp.proj.weight', 'transformer.layers.12.mlp.proj.weight', 'transformer.layers.6.mlp.fc.weight', 'transformer.layers.3.input_layernorm.weight', 'transformer.layers.12.post_layernorm.weight', 'transformer.layers.3.mlp.fc.weight', 'transformer.layers.4.mlp.fc.weight', 'transformer.layers.9.attention.qkv.weight', 'transformer.layers.7.mlp.proj.weight', 'transformer.vocab_embedding.weight', 'transformer.layers.6.mlp.gate.weight', 'transformer.layers.1.mlp.fc.weight', 'transformer.layers.6.post_layernorm.weight', 'transformer.layers.8.mlp.fc.weight', 'transformer.layers.10.post_layernorm.weight', 'transformer.layers.0.attention.dense.weight', 'transformer.layers.3.mlp.gate.weight', 'transformer.layers.9.mlp.proj.weight', 'transformer.layers.10.mlp.gate.weight', 'transformer.layers.0.attention.qkv.weight', 'transformer.layers.12.mlp.fc.weight', 'transformer.layers.2.mlp.gate.weight', 'transformer.layers.1.mlp.gate.weight', 'transformer.layers.2.input_layernorm.weight', 'transformer.layers.9.mlp.fc.weight', 'transformer.layers.11.attention.dense.weight', 'transformer.layers.2.post_layernorm.weight', 'transformer.layers.11.mlp.proj.weight', 'lm_head.weight', 'transformer.layers.10.input_layernorm.weight', 'transformer.layers.1.attention.dense.weight', 'transformer.layers.7.mlp.gate.weight', 'transformer.layers.8.input_layernorm.weight', 'transformer.layers.6.attention.dense.weight', 'transformer.layers.11.mlp.gate.weight', 'transformer.layers.12.attention.qkv.weight', 'transformer.layers.10.attention.qkv.weight', 'transformer.layers.8.post_layernorm.weight', 'transformer.layers.6.input_layernorm.weight', 'transformer.layers.4.mlp.gate.weight', 'transformer.layers.8.attention.qkv.weight', 'transformer.layers.12.input_layernorm.weight', 'transformer.layers.11.mlp.fc.weight', 'transformer.layers.9.post_layernorm.weight', 'transformer.layers.11.input_layernorm.weight', 'transformer.layers.5.post_layernorm.weight', 'transformer.layers.4.attention.qkv.weight', 'transformer.layers.9.input_layernorm.weight', 'transformer.layers.10.mlp.fc.weight', 'transformer.layers.3.post_layernorm.weight', 'transformer.layers.5.attention.dense.weight', 'transformer.layers.9.attention.dense.weight', 'transformer.layers.7.attention.dense.weight', 'transformer.layers.1.post_layernorm.weight', 'transformer.layers.1.input_layernorm.weight', 'transformer.layers.4.input_layernorm.weight', 'transformer.layers.7.attention.qkv.weight', 'transformer.layers.2.mlp.fc.weight', 'transformer.layers.6.mlp.proj.weight', 'transformer.layers.10.attention.dense.weight', 'transformer.layers.0.mlp.fc.weight', 'transformer.layers.2.mlp.proj.weight', 'transformer.layers.8.mlp.gate.weight', 'transformer.layers.4.post_layernorm.weight', 'transformer.layers.7.post_layernorm.weight', 'transformer.layers.0.input_layernorm.weight', 'transformer.layers.5.mlp.proj.weight', 'transformer.layers.8.mlp.proj.weight', 'transformer.layers.2.attention.qkv.weight', 'transformer.layers.6.attention.qkv.weight', 'transformer.layers.5.attention.qkv.weight', 'transformer.layers.0.post_layernorm.weight', 'transformer.layers.11.post_layernorm.weight', 'transformer.layers.4.mlp.proj.weight', 'transformer.layers.7.mlp.fc.weight', 'transformer.layers.3.attention.qkv.weight', 'transformer.layers.3.mlp.proj.weight', 'transformer.layers.9.mlp.gate.weight', 'transformer.layers.8.attention.dense.weight', 'transformer.layers.12.mlp.gate.weight', 'transformer.layers.2.attention.dense.weight', 'transformer.layers.5.mlp.gate.weight', 'transformer.layers.10.mlp.proj.weight'}
line 333 load PretrainedModel what the engine is expecting == {'transformer.layers.1.attention.qkv.weight', 'transformer.layers.31.mlp.gate.weight', 'transformer.layers.13.mlp.fc.weight', 'transformer.layers.26.mlp.proj.weight', 'transformer.layers.15.mlp.fc.weight', 'transformer.layers.0.mlp.gate.weight', 'transformer.layers.28.post_layernorm.weight', 'transformer.layers.31.input_layernorm.weight', 'transformer.layers.3.attention.dense.weight', 'transformer.layers.7.input_layernorm.weight', 'transformer.layers.25.post_layernorm.weight', 'transformer.layers.5.input_layernorm.weight', 'transformer.layers.4.attention.dense.weight', 'transformer.layers.19.post_layernorm.weight', 'transformer.layers.21.attention.dense.weight', 'transformer.layers.16.post_layernorm.weight', 'transformer.layers.30.input_layernorm.weight', 'transformer.layers.5.mlp.fc.weight', 'transformer.layers.12.attention.dense.weight', 'transformer.layers.11.attention.qkv.weight', 'transformer.layers.18.post_layernorm.weight', 'transformer.layers.1.mlp.proj.weight', 'transformer.layers.31.attention.dense.weight', 'transformer.layers.21.attention.qkv.weight', 'transformer.layers.0.mlp.proj.weight', 'transformer.layers.28.attention.dense.weight', 'transformer.layers.24.mlp.fc.weight', 'transformer.layers.20.attention.dense.weight', 'transformer.layers.27.mlp.fc.weight', 'transformer.layers.13.mlp.proj.weight', 'transformer.layers.17.attention.dense.weight', 'transformer.layers.24.post_layernorm.weight', 'transformer.layers.12.mlp.proj.weight', 'transformer.layers.13.attention.dense.weight', 'transformer.layers.20.mlp.proj.weight', 'transformer.layers.30.post_layernorm.weight', 'transformer.layers.16.input_layernorm.weight', 'transformer.layers.29.attention.qkv.weight', 'transformer.layers.30.attention.dense.weight', 'transformer.layers.30.attention.qkv.weight', 'transformer.layers.6.mlp.fc.weight', 'transformer.layers.3.input_layernorm.weight', 'transformer.layers.3.mlp.fc.weight', 'transformer.layers.12.post_layernorm.weight', 'transformer.layers.29.input_layernorm.weight', 'transformer.layers.22.input_layernorm.weight', 'transformer.layers.17.mlp.fc.weight', 'transformer.layers.4.mlp.fc.weight', 'transformer.layers.27.attention.dense.weight', 'transformer.layers.9.attention.qkv.weight', 'transformer.layers.15.attention.qkv.weight', 'transformer.vocab_embedding.weight', 'transformer.layers.7.mlp.proj.weight', 'transformer.layers.6.mlp.gate.weight', 'transformer.layers.1.mlp.fc.weight', 'transformer.layers.6.post_layernorm.weight', 'transformer.layers.23.attention.dense.weight', 'transformer.layers.8.mlp.fc.weight', 'transformer.layers.21.mlp.proj.weight', 'transformer.layers.21.input_layernorm.weight', 'transformer.layers.10.post_layernorm.weight', 'transformer.layers.26.attention.dense.weight', 'transformer.layers.0.attention.dense.weight', 'transformer.layers.31.attention.qkv.weight', 'transformer.layers.13.post_layernorm.weight', 'transformer.layers.23.input_layernorm.weight', 'transformer.layers.3.mlp.gate.weight', 'transformer.layers.27.mlp.proj.weight', 'transformer.layers.27.mlp.gate.weight', 'transformer.layers.9.mlp.proj.weight', 'transformer.layers.21.mlp.fc.weight', 'transformer.layers.18.mlp.fc.weight', 'transformer.layers.10.mlp.gate.weight', 'transformer.layers.0.attention.qkv.weight', 'transformer.layers.12.mlp.fc.weight', 'transformer.layers.29.mlp.gate.weight', 'transformer.layers.2.mlp.gate.weight', 'transformer.layers.24.attention.dense.weight', 'transformer.layers.22.attention.qkv.weight', 'transformer.layers.26.mlp.fc.weight', 'transformer.layers.18.mlp.proj.weight', 'transformer.ln_f.weight', 'transformer.layers.1.mlp.gate.weight', 'transformer.layers.2.input_layernorm.weight', 'transformer.layers.26.input_layernorm.weight', 'transformer.layers.14.mlp.proj.weight', 'transformer.layers.26.attention.qkv.weight', 'transformer.layers.24.attention.qkv.weight', 'transformer.layers.9.mlp.fc.weight', 'transformer.layers.11.attention.dense.weight', 'transformer.layers.29.post_layernorm.weight', 'transformer.layers.2.post_layernorm.weight', 'transformer.layers.11.mlp.proj.weight', 'transformer.layers.28.input_layernorm.weight', 'transformer.layers.15.mlp.proj.weight', 'transformer.layers.21.mlp.gate.weight', 'transformer.layers.16.attention.dense.weight', 'transformer.layers.26.post_layernorm.weight', 'transformer.layers.10.input_layernorm.weight', 'transformer.layers.1.attention.dense.weight', 'transformer.layers.23.mlp.gate.weight', 'lm_head.weight', 'transformer.layers.13.input_layernorm.weight', 'transformer.layers.7.mlp.gate.weight', 'transformer.layers.22.attention.dense.weight', 'transformer.layers.8.input_layernorm.weight', 'transformer.layers.15.attention.dense.weight', 'transformer.layers.25.mlp.gate.weight', 'transformer.layers.6.attention.dense.weight', 'transformer.layers.11.mlp.gate.weight', 'transformer.layers.28.mlp.fc.weight', 'transformer.layers.12.attention.qkv.weight', 'transformer.layers.22.mlp.fc.weight', 'transformer.layers.18.input_layernorm.weight', 'transformer.layers.24.mlp.gate.weight', 'transformer.layers.10.attention.qkv.weight', 'transformer.layers.19.mlp.fc.weight', 'transformer.layers.8.post_layernorm.weight', 'transformer.layers.22.post_layernorm.weight', 'transformer.layers.23.post_layernorm.weight', 'transformer.layers.27.post_layernorm.weight', 'transformer.layers.25.mlp.fc.weight', 'transformer.layers.6.input_layernorm.weight', 'transformer.layers.16.mlp.fc.weight', 'transformer.layers.16.attention.qkv.weight', 'transformer.layers.25.input_layernorm.weight', 'transformer.layers.4.mlp.gate.weight', 'transformer.layers.29.mlp.fc.weight', 'transformer.layers.13.attention.qkv.weight', 'transformer.layers.8.attention.qkv.weight', 'transformer.layers.23.mlp.fc.weight', 'transformer.layers.12.input_layernorm.weight', 'transformer.layers.15.input_layernorm.weight', 'transformer.layers.19.mlp.gate.weight', 'transformer.layers.24.mlp.proj.weight', 'transformer.layers.20.mlp.gate.weight', 'transformer.layers.11.mlp.fc.weight', 'transformer.layers.9.post_layernorm.weight', 'transformer.layers.14.mlp.fc.weight', 'transformer.layers.20.input_layernorm.weight', 'transformer.layers.25.mlp.proj.weight', 'transformer.layers.28.mlp.gate.weight', 'transformer.layers.13.mlp.gate.weight', 'transformer.layers.11.input_layernorm.weight', 'transformer.layers.5.post_layernorm.weight', 'transformer.layers.4.attention.qkv.weight', 'transformer.layers.30.mlp.gate.weight', 'transformer.layers.22.mlp.proj.weight', 'transformer.layers.22.mlp.gate.weight', 'transformer.layers.9.input_layernorm.weight', 'transformer.layers.3.post_layernorm.weight', 'transformer.layers.10.mlp.fc.weight', 'transformer.layers.23.attention.qkv.weight', 'transformer.layers.5.attention.dense.weight', 'transformer.layers.24.input_layernorm.weight', 'transformer.layers.14.mlp.gate.weight', 'transformer.layers.9.attention.dense.weight', 'transformer.layers.21.post_layernorm.weight', 'transformer.layers.7.attention.dense.weight', 'transformer.layers.14.input_layernorm.weight', 'transformer.layers.19.attention.qkv.weight', 'transformer.layers.1.post_layernorm.weight', 'transformer.layers.20.post_layernorm.weight', 'transformer.layers.17.attention.qkv.weight', 'transformer.layers.30.mlp.proj.weight', 'transformer.layers.1.input_layernorm.weight', 'transformer.layers.4.input_layernorm.weight', 'transformer.layers.7.attention.qkv.weight', 'transformer.layers.17.input_layernorm.weight', 'transformer.layers.2.mlp.fc.weight', 'transformer.layers.14.attention.qkv.weight', 'transformer.layers.18.attention.dense.weight', 'transformer.layers.6.mlp.proj.weight', 'transformer.layers.14.attention.dense.weight', 'transformer.layers.17.mlp.proj.weight', 'transformer.layers.19.input_layernorm.weight', 'transformer.layers.10.attention.dense.weight', 'transformer.layers.19.mlp.proj.weight', 'transformer.layers.0.mlp.fc.weight', 'transformer.layers.2.mlp.proj.weight', 'transformer.layers.8.mlp.gate.weight', 'transformer.layers.19.attention.dense.weight', 'transformer.layers.14.post_layernorm.weight', 'transformer.layers.4.post_layernorm.weight', 'transformer.layers.15.mlp.gate.weight', 'transformer.layers.28.mlp.proj.weight', 'transformer.layers.7.post_layernorm.weight', 'transformer.layers.29.mlp.proj.weight', 'transformer.layers.0.input_layernorm.weight', 'transformer.layers.5.mlp.proj.weight', 'transformer.layers.17.mlp.gate.weight', 'transformer.layers.31.post_layernorm.weight', 'transformer.layers.8.mlp.proj.weight', 'transformer.layers.16.mlp.proj.weight', 'transformer.layers.2.attention.qkv.weight', 'transformer.layers.6.attention.qkv.weight', 'transformer.layers.28.attention.qkv.weight', 'transformer.layers.30.mlp.fc.weight', 'transformer.layers.5.attention.qkv.weight', 'transformer.layers.0.post_layernorm.weight', 'transformer.layers.20.mlp.fc.weight', 'transformer.layers.11.post_layernorm.weight', 'transformer.layers.25.attention.qkv.weight', 'transformer.layers.4.mlp.proj.weight', 'transformer.layers.7.mlp.fc.weight', 'transformer.layers.18.attention.qkv.weight', 'transformer.layers.29.attention.dense.weight', 'transformer.layers.27.attention.qkv.weight', 'transformer.layers.31.mlp.fc.weight', 'transformer.layers.3.attention.qkv.weight', 'transformer.layers.18.mlp.gate.weight', 'transformer.layers.3.mlp.proj.weight', 'transformer.layers.9.mlp.gate.weight', 'transformer.layers.8.attention.dense.weight', 'transformer.layers.15.post_layernorm.weight', 'transformer.layers.17.post_layernorm.weight', 'transformer.layers.31.mlp.proj.weight', 'transformer.layers.25.attention.dense.weight', 'transformer.layers.2.attention.dense.weight', 'transformer.layers.12.mlp.gate.weight', 'transformer.layers.27.input_layernorm.weight', 'transformer.layers.16.mlp.gate.weight', 'transformer.layers.26.mlp.gate.weight', 'transformer.layers.23.mlp.proj.weight', 'transformer.layers.20.attention.qkv.weight', 'transformer.layers.5.mlp.gate.weight', 'transformer.layers.10.mlp.proj.weight'}
Traceback (most recent call last):
File "C:\Users\RayBe\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\Users\RayBe\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in _run_code
exec(code, run_globals)
File "F:\pythonprograms\llmstreaming.venv\Scripts\trtllm-build.exe__main__.py", line 7, in
(.venv) F:\pythonprograms\llmstreaming>
I have sucessfully converted a Mixtral 8x7B model with tensor parallelism following this script from llama example folder :
python convert_checkpoint.py --model_dir ./Mixtral-8x7B-v0.1 \ --output_dir ./tllm_checkpoint_mixtral_2gpu \ --dtype float16 \ --tp_size 2
Then, when I start building the engine with this command :
trtllm-build --checkpoint_dir ./tllm_checkpoint_mixtral_2gpu \ --output_dir ./trt_engines/mixtral/tp2 \ --gemm_plugin float16
This error appears :
Traceback (most recent call last): File "/usr/local/bin/trtllm-build", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 489, in main
parallel_build(source, build_config, args.output_dir, workers,
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 413, in parallel_build
passed = build_and_save(rank, rank % workers, ckpt_dir,
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 385, in build_and_save
engine = build(build_config,
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 266, in build
model.load(weights)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 338, in load
raise RuntimeError(err_msg)
RuntimeError: Provided tensor names are different from those expected by the engine.
Do you have any suggestion how to solve this issue please?
Many thanks for your help