Open PeterWang1986 opened 5 months ago
My solution in llama/convert_checkpoint.py
... # about line 666
def get_tllm_linear_weight(weight,
prefix,
bias=None,
use_weight_only=False,
plugin_weight_only_quant_type=torch.int8,
dtype='float32',
use_gemm_woq_plugin=True,
postfix='weight'):
results = {}
print(f"{weight.shape=}")
if use_weight_only:
if len(weight.shape)==3:
v = weight.permute(0, 2, 1).contiguous()
else:
v = weight.t().contiguous()
processed_torch_weights, torch_weight_scales = \
torch.ops.trtllm.symmetric_quantize_last_axis_of_batched_matrix(
v.cpu(), plugin_weight_only_quant_type)
if not use_gemm_woq_plugin:
results[prefix + postfix] = v.to(dtype)
else:
results[prefix + postfix] = processed_torch_weights
if postfix != '':
**results[prefix + 'per_channel_scale'] = torch_weight_scales**
else:
**results[prefix.replace("experts_weight",'experts_scale')] = torch_weight_scales**
else:
results[prefix + postfix] = weight.contiguous()
if bias is not None:
results[prefix + 'bias'] = bias
return results
but convert tllm_prex + 'mlp.router.'
RuntimeError: [TensorRT-LLM][ERROR] Assertion failed: Number of bytes for rows and cols must be a multiple of 32. However, num_rows_bytes = 2048 and num_col_bytes = 4. (/TensorRT-LLM/cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_preprocessors.cpp:255)
router .shape [4096,8]->will be error due to this kernels.
// We assume the dims are a multiple of vector width. Our kernels only handle dims which are multiples
// of 64 for weight-only quantization. As a result, this seemed like a reasonable tradeoff because it
// allows GCC to emit vector instructions.
TLLM_CHECK_WITH_INFO(!(col_bytes_trans % VECTOR_WIDTH) && !(col_bytes % VECTOR_WIDTH),
fmtstr("Number of bytes for rows and cols must be a multiple of %d. However, num_rows_bytes = %ld and "
"num_col_bytes = %ld.",
VECTOR_WIDTH, col_bytes_trans, col_bytes));
Since operator quantization cannot be used, I think self.router
very small does not need to be quantized (proportional weight), keep fp16.
so set convert_checkpoint.py script
.... about line-1422
moe_experts_gate_weights = get_weight(
model_params, prefix + 'block_sparse_moe.gate', dtype)
v = split(moe_experts_gate_weights,
mapping.tp_size,
mapping.tp_rank,
dim=-1)
weights.update(
get_tllm_linear_weight(v.to(torch.float32),
tllm_prex + 'mlp.router.', None,
**False**,
plugin_weight_only_quant_type, dtype,
use_gemm_woq_plugin))
... # about line 1757
if args.use_weight_only:
if args.weight_only_precision == 'int8':
config['quantization']['quant_algo'] = 'W8A16'
elif args.weight_only_precision == 'int4':
config['quantization']['quant_algo'] = 'W4A16'
# No quantization router when woq
if args.moe_num_experts > 0:
**config['quantization']['exclude_modules'] = ['lm_head','router']**
...
re-converted & re-build but it could't deploy
python: malloc.c:2617: sysmalloc: Assertion `(old_top == initial_top (av) && old_size == 0) || ((unsigned long) (old_size) >= MINSIZE && prev_inuse (old_top) && ((unsigned long) old_end & (pagesize - 1)) == 0)' failed.
[dev-aigc-20:51558] *** Process received signal ***
[dev-aigc-20:51558] Signal: Aborted (6)
[dev-aigc-20:51558] Signal code: (-6)
[dev-aigc-20:51558] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7febb6f1b520]
[dev-aigc-20:51558] [ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7febb6f6f9fc]
[dev-aigc-20:51558] [ 2] /usr/lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7febb6f1b476]
[dev-aigc-20:51558] [ 3] /usr/lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7febb6f017f3]
[dev-aigc-20:51558] [ 4] /usr/lib/x86_64-linux-gnu/libc.so.6(+0xa0eca)[0x7febb6f79eca]
[dev-aigc-20:51558] [ 5] /usr/lib/x86_64-linux-gnu/libc.so.6(+0xa3937)[0x7febb6f7c937]
[dev-aigc-20:51558] [ 6] /usr/lib/x86_64-linux-gnu/libc.so.6(+0xa48dd)[0x7febb6f7d8dd]
[dev-aigc-20:51558] [ 7] /usr/lib/x86_64-linux-gnu/libc.so.6(malloc+0x99)[0x7febb6f7e139]
[dev-aigc-20:51558] [ 8] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(_Znwm+0x1c)[0x7feb8ce8298c]
[dev-aigc-20:51558] [ 9] /mnt/nfs/dev-aigc-2/data2/nidongwang/tmp/code/TensorRT-LLM/tensorrt_llm/libs/libth_common.so(_ZN9torch_ext15FtDynamicDecodeIfEC1Emmmii+0x26f)[0x7feb1b22e68f]
[dev-aigc-20:51558] [10] /mnt/nfs/dev-aigc-2/data2/nidongwang/tmp/code/TensorRT-LLM/tensorrt_llm/libs/libth_common.so(_ZN9torch_ext15DynamicDecodeOp14createInstanceEv+0xe7)[0x7feb1b21fe67]
[dev-aigc-20:51558] [11] /mnt/nfs/dev-aigc-2/data2/nidongwang/tmp/code/TensorRT-LLM/tensorrt_llm/libs/libth_common.so(_ZN9torch_ext15DynamicDecodeOpC2ElllllN3c1010ScalarTypeE+0x8c)[0x7feb1b2202ac]
[dev-aigc-20:51558] [12] /mnt/nfs/dev-aigc-2/data2/nidongwang/tmp/code/TensorRT-LLM/tensorrt_llm/libs/libth_common.so(_ZNSt17_Function_handlerIFvRSt6vectorIN3c106IValueESaIS2_EEEZN5torch6class_IN9torch_ext15DynamicDecodeOpEE12defineMethodIZNSB_3defIJlllllNS1_10ScalarTypeEEEERSB_NS7_6detail5typesIvJDpT_EEESsSt16initializer_listINS7_3argEEEUlNS1_14tagged_capsuleISA_EElllllSE_E_EEPNS7_3jit8FunctionESsT_SsSN_EUlS5_E_E9_M_invokeERKSt9_Any_dataS5_+0xf2)[0x7feb1b23e542]
[dev-aigc-20:51558] [13] /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so(+0x947e8e)[0x7feb8a847e8e]
[dev-aigc-20:51558] [14] /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so(+0x94542e)[0x7feb8a84542e]
[dev-aigc-20:51558] [15] /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so(+0x9474c9)[0x7feb8a8474c9]
[dev-aigc-20:51558] [16] /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so(+0x3f7304)[0x7feb8a2f7304]
[dev-aigc-20:51558] [17] /usr/bin/python(+0x15fe0e)[0x55f005ab2e0e]
[dev-aigc-20:51558] [18] /usr/bin/python(_PyObject_MakeTpCall+0x25b)[0x55f005aa95eb]
[dev-aigc-20:51558] [19] /usr/bin/python(+0x16e910)[0x55f005ac1910]
[dev-aigc-20:51558] [20] /usr/bin/python(+0x28420b)[0x55f005bd720b]
[dev-aigc-20:51558] [21] /usr/bin/python(_PyObject_MakeTpCall+0x25b)[0x55f005aa95eb]
[dev-aigc-20:51558] [22] /usr/bin/python(_PyEval_EvalFrameDefault+0x6aa1)[0x55f005aa21f1]
[dev-aigc-20:51558] [23] /usr/bin/python(_PyFunction_Vectorcall+0x7c)[0x55f005ab370c]
[dev-aigc-20:51558] [24] /usr/bin/python(_PyObject_FastCallDictTstate+0x16d)[0x55f005aa882d]
[dev-aigc-20:51558] [25] /usr/bin/python(+0x16a744)[0x55f005abd744]
[dev-aigc-20:51558] [26] /usr/bin/python(_PyObject_MakeTpCall+0x1fc)[0x55f005aa958c]
[dev-aigc-20:51558] [27] /usr/bin/python(_PyEval_EvalFrameDefault+0x71b8)[0x55f005aa2908]
[dev-aigc-20:51558] [28] /usr/bin/python(+0x16e4e1)[0x55f005ac14e1]
[dev-aigc-20:51558] [29] /usr/bin/python(_PyEval_EvalFrameDefault+0x1981)[0x55f005a9d0d1]
[dev-aigc-20:51558] *** End of error message ***
I have exactly the same problem when I try to use the convert script for a Mixtral 8 x 7B for conversion to int8 :
python3 convert_checkpoint.py --model_dir ./Mixtral8x7B/ \ --output_dir ./tllm_checkpoint_1gpu_fp16_wq/ \ --dtype float16 \ --use_weight_only \ --weight_only_precision int8
But this error appeared :
Traceback (most recent call last):
File "/home/TensorRT-LLM/examples/llama/convert_checkpoint.py", line 1547, in
Do you have any update on the resolution to this issue please?
Many thanks!
I managed to quantize Mixtral 8x7B to 4 bpw.
I first tried running this command:
model="models--mistralai--Mixtral-8x7B-Instruct-v0.1"
model_dir="/models/$model"
model_chkpt_dir="/models/$model--trt-chkpt"
python3 TensorRT-LLM/examples/llama/convert_checkpoint.py \
--model_dir "$model_dir" \
--output_dir "$model_chkpt_dir" \
--dtype float16 \
--use_weight_only \
--weight_only_precision int4 \
--int8_kv_cache
and got the same error as @xesdiny:
RuntimeError: [TensorRT-LLM][ERROR] Assertion failed: Number of bytes for rows and cols must be a multiple of 32. However, num_rows_bytes = 2048 and num_col_bytes = 4. (/home/jenkins/agent/workspace/LLM/main/L0_MergeRequest/tensorrt_llm/cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_preprocessors.cpp:258)
There has been some changes in the library code since the comment of @xesdiny, the second change proposed by @xesdiny (disabling the router
module) seems to be already implemented in models/llama/convert.py
when a mixture of experts model is detected.
I tried applying the first changed proposed by @xesdiny (passing a False
value to get_tllm_linear_weight
), for clarity this is the diff of the change in the diff unified (-u) format:
--- convert.py 2024-03-25 19:38:29.042100000 +0100
+++ /usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/convert.py 2024-03-25 19:46:03.221558000 +0100
@@ -791,7 +789,7 @@
weights.update(
get_tllm_linear_weight(
moe_experts_gate_weights.to(torch.float32),
- tllm_prex + 'mlp.router.', None, use_weight_only,
+ tllm_prex + 'mlp.router.', None, False,
plugin_weight_only_quant_type, dtype, use_gemm_woq_plugin))
else:
mlp_gate_weight = get_weight(model_params, prefix + 'mlp.up_proj',
With that change I tried running again the conversion command and it succeeded. Then I built the engine without any error:
model_engine_dir="/models/$model--trt-engine"
trtllm-build --checkpoint_dir "$model_chkpt_dir" \
--gemm_plugin float16 \
--output_dir "$model_engine_dir"
Finally I tried the summarize example to test the engine, which also ran without error:
python3 TensorRT-LLM/examples/summarize.py --test_trt_llm \
--hf_model_dir "$model_dir" \
--data_type fp16 \
--engine_dir "$model_engine_dir"
I managed to quantize Mixtral 8x7B to 4 bpw.
I first tried running this command:
model="models--mistralai--Mixtral-8x7B-Instruct-v0.1" model_dir="/models/$model" model_chkpt_dir="/models/$model--trt-chkpt" python3 TensorRT-LLM/examples/llama/convert_checkpoint.py \ --model_dir "$model_dir" \ --output_dir "$model_chkpt_dir" \ --dtype float16 \ --use_weight_only \ --weight_only_precision int4 \ --int8_kv_cache
and got the same error as @xesdiny:
RuntimeError: [TensorRT-LLM][ERROR] Assertion failed: Number of bytes for rows and cols must be a multiple of 32. However, num_rows_bytes = 2048 and num_col_bytes = 4. (/home/jenkins/agent/workspace/LLM/main/L0_MergeRequest/tensorrt_llm/cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_preprocessors.cpp:258)
There has been some changes in the library code since the comment of @xesdiny, the second change proposed by @xesdiny (disabling the
router
module) seems to be already implemented inmodels/llama/convert.py
when a mixture of experts model is detected.I tried applying the first changed proposed by @xesdiny (passing a
False
value toget_tllm_linear_weight
), for clarity this is the diff of the change in the diff unified (-u) format:--- convert.py 2024-03-25 19:38:29.042100000 +0100 +++ /usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/convert.py 2024-03-25 19:46:03.221558000 +0100 @@ -791,7 +789,7 @@ weights.update( get_tllm_linear_weight( moe_experts_gate_weights.to(torch.float32), - tllm_prex + 'mlp.router.', None, use_weight_only, + tllm_prex + 'mlp.router.', None, False, plugin_weight_only_quant_type, dtype, use_gemm_woq_plugin)) else: mlp_gate_weight = get_weight(model_params, prefix + 'mlp.up_proj',
With that change I tried running again the conversion command and it succeeded. Then I built the engine without any error:
model_engine_dir="/models/$model--trt-engine" trtllm-build --checkpoint_dir "$model_chkpt_dir" \ --gemm_plugin float16 \ --output_dir "$model_engine_dir"
Finally I tried the summarize example to test the engine, which also ran without error:
python3 TensorRT-LLM/examples/summarize.py --test_trt_llm \ --hf_model_dir "$model_dir" \ --data_type fp16 \ --engine_dir "$model_engine_dir"
which setup did you run for quantization? i thought about quantizing like this "--weight_only_precision int4" as well, but haven't been able to find how to use multi-gpu setup when quantizing with tensorrt, and single gpu would need to have over 100gb vram (for Mixtral) i think
Yes, this requires at least 100GB of VRAM. I executed the code on a system equipped with three Nvidia A100 GPUs, each with 40GB of VRAM, so 120GB in total. When running the convert_checkpoint.py
script the model was automatically distributed across the three GPUs.
System Info
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
call convert script: python /app/tensorrt_llm/examples/llama/convert_checkpoint.py \ --model_dir /xxxx/NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO/ \ --output_dir /xxxxx/checkpoint \ --dtype float16 \ --use_weight_only \ --weight_only_precision int8 \ --tp_size 8
Expected behavior
convert successful
actual behavior
You are using a model of type mixtral to instantiate a model of type llama. This is not supported for all configurations of models and can yield errors. Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 19/19 [29:54<00:00, 94.44s/it] Traceback (most recent call last): File "/app/tensorrt_llm/examples/llama/convert_checkpoint.py", line 1971, in
main()
File "/app/tensorrt_llm/examples/llama/convert_checkpoint.py", line 1956, in main
covert_and_save(rank, convert_args)
File "/app/tensorrt_llm/examples/llama/convert_checkpoint.py", line 1893, in covert_and_save
weights = convert_hf_llama(
File "/app/tensorrt_llm/examples/llama/convert_checkpoint.py", line 1393, in convert_hf_llama
get_tllm_linear_weight(moe_experts_w2_weights,
File "/app/tensorrt_llm/examples/llama/convert_checkpoint.py", line 676, in get_tllm_linear_weight
v = weight.t().contiguous()
RuntimeError: t() expects a tensor with <= 2 dimensions, but self is 3D
additional notes
NO