Open ttim opened 8 months ago
@ttim you can use INT4 weights & FP8 activations with the following steps:
--cuda_architectures "90-real"
.qformat
with w4a8_awq
for quantize.py
.W4A8_AWQ
along with FP8 kv cache, please add --kv_cache_dtype fp8
for the quantize.py
command.@Barry-Delaney ,I got an error when build w4a8_awq engines of llama7b with trtllm-build tools. My TRT-LLM version is v0.8.0. (by the way,the quantization stage is OK when use quantize.py script)
"RuntimeError: Provided tensor names are different from those expected by the engine"
Could you give me some advices? Is this a bug?
@felixslu This error appears when the generated checkpoints' names are not converted correctly. This is fixed in the latest main branch.
@felixslu This error appears when the generated checkpoints' names are not converted correctly. This is fixed in the latest main branch.
Hi @Barry-Delaney Is this issue fixed in v8.0?
Because I meet the same error which is "RuntimeError: Provided tensor names are different from those expected by the engine."
The command is:
python /data/tensorrt_llm/examples/quantization/quantize.py --model_dir /data/deepseek-6.7b-online-v2.1 \
--dtype bfloat16 \
--qformat w4a8_awq \
--tp_size 2 \
--awq_block_size 128 \
--kv_cache_dtype fp8 \
--output_dir /data/deepseek-6.7b-online-v2.1-w4a8-awq-tp2 \
--calib_size 32
trtllm-build --checkpoint_dir /data/deepseek-6.7b-online-v2.1-w4a8-awq-tp2 \
--output_dir /data/trt-engines-deepseek-6.7b-online-v2.1-w4a8-awq-tp2 \
--workers 2 \
--paged_kv_cache enable \
--gpt_attention_plugin bfloat16 \
--max_batch_size 64 \
--gemm_plugin bfloat16
The error is:
[TensorRT-LLM] TensorRT-LLM version: 0.8.0[05/22/2024-12:58:33] [TRT-LLM] [I] Set bert_attention_plugin to float16.
[05/22/2024-12:58:33] [TRT-LLM] [I] Set gpt_attention_plugin to bfloat16.
[05/22/2024-12:58:33] [TRT-LLM] [I] Set gemm_plugin to bfloat16.
[05/22/2024-12:58:33] [TRT-LLM] [I] Set lookup_plugin to None.
[05/22/2024-12:58:33] [TRT-LLM] [I] Set lora_plugin to None.
[05/22/2024-12:58:33] [TRT-LLM] [I] Set context_fmha to True.
[05/22/2024-12:58:33] [TRT-LLM] [I] Set context_fmha_fp32_acc to False.
[05/22/2024-12:58:33] [TRT-LLM] [I] Set paged_kv_cache to True.
[05/22/2024-12:58:33] [TRT-LLM] [I] Set remove_input_padding to True.
[05/22/2024-12:58:33] [TRT-LLM] [I] Set use_custom_all_reduce to True.
[05/22/2024-12:58:33] [TRT-LLM] [I] Set multi_block_mode to False.
[05/22/2024-12:58:33] [TRT-LLM] [I] Set enable_xqa to True.
[05/22/2024-12:58:33] [TRT-LLM] [I] Set attention_qk_half_accumulation to False.
[05/22/2024-12:58:33] [TRT-LLM] [I] Set tokens_per_block to 128.
[05/22/2024-12:58:33] [TRT-LLM] [I] Set use_paged_context_fmha to False.
[05/22/2024-12:58:33] [TRT-LLM] [I] Set use_context_fmha_for_generation to False.
[05/22/2024-12:58:33] [TRT-LLM] [W] remove_input_padding is enabled, while max_num_tokens is not set, setting to max_batch_size*max_input_len.
It may not be optimal to set max_num_tokens=max_batch_size*max_input_len when remove_input_padding is enabled, because the number of packed input tokens are very likely to be smaller, we strongly recommend to set max_num_tokens according to your workloads.
concurrent.futures.process._RemoteTraceback:
"""
Traceback (most recent call last):
File "/usr/lib/python3.10/concurrent/futures/process.py", line 246, in _process_worker
r = call_item.fn(*call_item.args, **call_item.kwargs)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 392, in build_and_save
engine = build(build_config,
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 272, in build
model.load(weights)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 338, in load
raise RuntimeError(err_msg)
RuntimeError: Provided tensor names are different from those expected by the engine.
@activezhao This is because Deepseek is not supported yet.
@activezhao This is because Deepseek is not supported yet.
@Barry-Delaney OK, is there any plan to support it?
We are currently using fp8, and the performance is great. Now we want to try w4a8.
W4A8 does not support Deepseek, how about Int4 AWQ (W4A16)?
Thanks.
@activezhao Thanks for the feedback! INT4-AWQ is also not supported as the Deepseek model is not implemented yet. Please start a feature request about Deepseek model incase you need the support.
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."
System Info
H100, Llama 70B
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Expected behavior
Should be possible
actual behavior
Impossible
additional notes
Not a bug, but a feature request