Closed christian-ci closed 1 month ago
Update. I thought I was wrong by using model type as llama
instead of mixtral
but the only diff is the lib doesn't build the engine and directs you to use the TensorRT-LLM build. I also tried:
trtllm-build --checkpoint_dir /shared/mixtral-8x7b-awq-instruct-base-trt-tp4 \
--output_dir /shared/mixtral-8x7b-awq-instruct-engine-trt-tp4 \
--gemm_plugin float16
[TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024052100
[05/21/2024-19:55:47] [TRT-LLM] [I] Set bert_attention_plugin to float16.
[05/21/2024-19:55:47] [TRT-LLM] [I] Set gpt_attention_plugin to float16.
[05/21/2024-19:55:47] [TRT-LLM] [I] Set gemm_plugin to float16.
[05/21/2024-19:55:47] [TRT-LLM] [I] Set nccl_plugin to float16.
[05/21/2024-19:55:47] [TRT-LLM] [I] Set lookup_plugin to None.
[05/21/2024-19:55:47] [TRT-LLM] [I] Set lora_plugin to None.
[05/21/2024-19:55:47] [TRT-LLM] [I] Set moe_plugin to float16.
[05/21/2024-19:55:47] [TRT-LLM] [I] Set mamba_conv1d_plugin to float16.
[05/21/2024-19:55:47] [TRT-LLM] [I] Set context_fmha to True.
[05/21/2024-19:55:47] [TRT-LLM] [I] Set context_fmha_fp32_acc to False.
[05/21/2024-19:55:47] [TRT-LLM] [I] Set paged_kv_cache to True.
[05/21/2024-19:55:47] [TRT-LLM] [I] Set remove_input_padding to True.
[05/21/2024-19:55:47] [TRT-LLM] [I] Set use_custom_all_reduce to True.
[05/21/2024-19:55:47] [TRT-LLM] [I] Set multi_block_mode to False.
[05/21/2024-19:55:47] [TRT-LLM] [I] Set enable_xqa to True.
[05/21/2024-19:55:47] [TRT-LLM] [I] Set attention_qk_half_accumulation to False.
[05/21/2024-19:55:47] [TRT-LLM] [I] Set tokens_per_block to 64.
[05/21/2024-19:55:47] [TRT-LLM] [I] Set use_paged_context_fmha to False.
[05/21/2024-19:55:47] [TRT-LLM] [I] Set use_fp8_context_fmha to False.
[05/21/2024-19:55:47] [TRT-LLM] [I] Set multiple_profiles to False.
[05/21/2024-19:55:47] [TRT-LLM] [I] Set paged_state to True.
[05/21/2024-19:55:47] [TRT-LLM] [I] Set streamingllm to False.
[05/21/2024-19:55:47] [TRT-LLM] [W] remove_input_padding is enabled, while max_num_tokens is not set, setting to max_batch_size*max_input_len.
It may not be optimal to set max_num_tokens=max_batch_size*max_input_len when remove_input_padding is enabled, because the number of packed input tokens are very likely to be smaller, we strongly recommend to set max_num_tokens according to your workloads.
[05/21/2024-19:55:47] [TRT-LLM] [W] remove_input_padding is enabled, while opt_num_tokens is not set, setting to max_batch_size*max_beam_width.
/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py:1013: UserWarning: The use of `x.T` on tensors of dimension other than 2 to reverse their shape is deprecated and it will throw an error in a future release. Consider `x.mT` to transpose batches of matrices or `x.permute(*torch.arange(x.ndim - 1, -1, -1))` to reverse the dimensions of a tensor. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:3637.)
weights[name] = preprocessor(param.T.contiguous(),
Traceback (most recent call last):
File "/usr/local/bin/trtllm-build", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 496, in main
parallel_build(source, build_config, args.output_dir, workers,
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 377, in parallel_build
passed = build_and_save(rank, rank % workers, ckpt_dir,
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 336, in build_and_save
engine = build_model(build_config,
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 308, in build_model
model = load_model(rank_config, ckpt_dir, model_cls)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 1150, in load_model
preprocess_weights(weights,
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 1013, in preprocess_weights
weights[name] = preprocessor(param.T.contiguous(),
File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 755, in __call__
return self._op(*args, **(kwargs or {}))
RuntimeError: [TensorRT-LLM][ERROR] Assertion failed: Number of bytes for rows and cols must be a multiple of 32. However, num_rows_bytes = 1792 and num_col_bytes = 8. (/home/jenkins/agent/workspace/LLM/main/L0_PostMerge/tensorrt_llm/cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_preprocessors.cpp:278)
Same error. So we can't build the engine with the checkpoint from the quantization of this lib. Here is the generated config.json
{
"producer": {
"name": "modelopt",
"version": "0.11.2"
},
"architecture": "LlamaForCausalLM",
"dtype": "float16",
"num_hidden_layers": 32,
"num_attention_heads": 32,
"num_key_value_heads": 8,
"hidden_size": 4096,
"norm_epsilon": 1e-05,
"vocab_size": 32000,
"max_position_embeddings": 32768,
"hidden_act": "swiglu",
"use_parallel_embedding": true,
"embedding_sharding_dim": 0,
"quantization": {
"quant_algo": "W4A16_AWQ",
"kv_cache_quant_algo": "FP8",
"group_size": 128,
"has_zero_point": false,
"pre_quant_scale": true,
"exclude_modules": [
"lm_head"
]
},
"mapping": {
"world_size": 4,
"tp_size": 4,
"pp_size": 1
},
"head_size": 128,
"intermediate_size": 14336,
"position_embedding_type": "rope_gpt_neox",
"share_embedding_table": false,
"residual_mlp": false,
"bias": false,
"rotary_pct": 1.0,
"rank": 0,
"decoder": "llama",
"rmsnorm": true,
"lm_head_bias": false,
"moe_num_experts": 8,
"moe_top_k": 2
}
I believe the int4 awq of mixtral has not been supported in the public TRT LLM release yet. In the coming TRT LLM release, we will support fp8 first and maybe int4 awq later.
I believe the int4 awq of mixtral has not been supported in the public TRT LLM release yet. In the coming TRT LLM release, we will support fp8 first and maybe int4 awq later.
@cjluo-omniml Ok. Thanks for the info but why support fp8 first which is only supported by Ada and Hopper archs (Much harder to get anywhere and more expensive) than int4awq which can run on all Ampere's which are more available like A10G in AWS or any A100?
For LLM serving, especially enterprise LLM serving, large batch throughput is the focus for optimization. int4_awq is usually good at low batch size but the performance gain diminishes compared with FP8 when the batch size increases.
For LLM serving, especially enterprise LLM serving, large batch throughput is the focus for optimization. int4_awq is usually good at low batch size but the performance gain diminishes compared with FP8 when the batch size increases.
@cjluo-omniml Yes, we are aware of this but it's incredibly hard to get FP8 compatible cards or nodes in most cloud providers. In our case AWS which only offers A10G nodes for inference and can be obtained but getting an H100 node for inference loads is almost not possible. I guarantee you that there are other companies like us that fall in this category. We are part of the NVIDIA inception program if that helps. Do you have any suggestions for being able to use A10G with Tensor Parallelism and Flash attention in A10G or A100s?
For A10G, we recommend you try int8 smoothquant or int4 awq. For Mixtral support, so far the quantization development is focused on FP8 and int4 awq.
As to: Do you have any suggestions for being able to use A10G with Tensor Parallelism and Flash attention in A10G or A100s? Absolutely. I believe you can try FP16 Mixtral with TRT LLM as well if you have enough GPU memory on a single node.
@cjluo-omniml Thanks. We will try FP16 because we are super sensitive to accuracy and guess to wait for int4_awq
release and support so we can reduce the size of the node etc. We would really appreciate if its at least released on main
or in tandem with FP8 on the cut releases. Thanks again!
Hi. I ran the command:
On a Node with 8 A100 GPUs. I set TP to 4 because I want to build the engine for 4 GPU down the flow. The model was successfully quantized but when it starts the converstion of the checkpoint to TensorRT-LLM engine it throws this error:
This error is thrown no matter the machine. Like I quantized in 8 GPUs A100-80GB and it throws the error there but also throws the error when trying to build the engine on a A10 as well. Same error.