NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
7.37k stars 796 forks source link

Mixtral convert error: t() expects a tensor with <= 2 dimensions, but self is 3D #1041

Open PeterWang1986 opened 5 months ago

PeterWang1986 commented 5 months ago

System Info

Who can help?

No response

Information

Tasks

Reproduction

call convert script: python /app/tensorrt_llm/examples/llama/convert_checkpoint.py \ --model_dir /xxxx/NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO/ \ --output_dir /xxxxx/checkpoint \ --dtype float16 \ --use_weight_only \ --weight_only_precision int8 \ --tp_size 8

Expected behavior

convert successful

actual behavior

You are using a model of type mixtral to instantiate a model of type llama. This is not supported for all configurations of models and can yield errors. Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 19/19 [29:54<00:00, 94.44s/it] Traceback (most recent call last): File "/app/tensorrt_llm/examples/llama/convert_checkpoint.py", line 1971, in main() File "/app/tensorrt_llm/examples/llama/convert_checkpoint.py", line 1956, in main covert_and_save(rank, convert_args) File "/app/tensorrt_llm/examples/llama/convert_checkpoint.py", line 1893, in covert_and_save weights = convert_hf_llama( File "/app/tensorrt_llm/examples/llama/convert_checkpoint.py", line 1393, in convert_hf_llama get_tllm_linear_weight(moe_experts_w2_weights, File "/app/tensorrt_llm/examples/llama/convert_checkpoint.py", line 676, in get_tllm_linear_weight v = weight.t().contiguous() RuntimeError: t() expects a tensor with <= 2 dimensions, but self is 3D

additional notes

NO

xesdiny commented 5 months ago

My solution in llama/convert_checkpoint.py

... # about line 666
def get_tllm_linear_weight(weight,
                           prefix,
                           bias=None,
                           use_weight_only=False,
                           plugin_weight_only_quant_type=torch.int8,
                           dtype='float32',
                           use_gemm_woq_plugin=True,
                           postfix='weight'):
    results = {}
    print(f"{weight.shape=}")
    if use_weight_only:
        if len(weight.shape)==3:
            v = weight.permute(0, 2, 1).contiguous()
        else:
            v = weight.t().contiguous()
        processed_torch_weights, torch_weight_scales = \
            torch.ops.trtllm.symmetric_quantize_last_axis_of_batched_matrix(
                v.cpu(), plugin_weight_only_quant_type)
        if not use_gemm_woq_plugin:
            results[prefix + postfix] = v.to(dtype)
        else:
            results[prefix + postfix] = processed_torch_weights
        if postfix != '':
            **results[prefix + 'per_channel_scale'] = torch_weight_scales**
        else:
            **results[prefix.replace("experts_weight",'experts_scale')] = torch_weight_scales**
    else:
        results[prefix + postfix] = weight.contiguous()

    if bias is not None:
        results[prefix + 'bias'] = bias

    return results

but convert tllm_prex + 'mlp.router.'

RuntimeError: [TensorRT-LLM][ERROR] Assertion failed: Number of bytes for rows and cols must be a multiple of 32. However, num_rows_bytes = 2048 and num_col_bytes = 4. (/TensorRT-LLM/cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_preprocessors.cpp:255)

router .shape [4096,8]->will be error due to this kernels.

// We assume the dims are a multiple of vector width. Our kernels only handle dims which are multiples
    // of 64 for weight-only quantization. As a result, this seemed like a reasonable tradeoff because it
    // allows GCC to emit vector instructions.
    TLLM_CHECK_WITH_INFO(!(col_bytes_trans % VECTOR_WIDTH) && !(col_bytes % VECTOR_WIDTH),
        fmtstr("Number of bytes for rows and cols must be a multiple of %d. However, num_rows_bytes = %ld and "
               "num_col_bytes = %ld.",
            VECTOR_WIDTH, col_bytes_trans, col_bytes));

Since operator quantization cannot be used, I think self.router very small does not need to be quantized (proportional weight), keep fp16.

so set convert_checkpoint.py script

.... about line-1422
moe_experts_gate_weights = get_weight(
                model_params, prefix + 'block_sparse_moe.gate', dtype)
            v = split(moe_experts_gate_weights,
                      mapping.tp_size,
                      mapping.tp_rank,
                      dim=-1)

            weights.update(
                get_tllm_linear_weight(v.to(torch.float32),
                                       tllm_prex + 'mlp.router.', None,
                                       **False**,
                                       plugin_weight_only_quant_type, dtype,
                                       use_gemm_woq_plugin))
... # about line 1757
  if args.use_weight_only:
        if args.weight_only_precision == 'int8':
            config['quantization']['quant_algo'] = 'W8A16'
        elif args.weight_only_precision == 'int4':
            config['quantization']['quant_algo'] = 'W4A16'
        # No quantization router when woq
        if args.moe_num_experts > 0:
            **config['quantization']['exclude_modules'] = ['lm_head','router']**
...

re-converted & re-build but it could't deploy

python: malloc.c:2617: sysmalloc: Assertion `(old_top == initial_top (av) && old_size == 0) || ((unsigned long) (old_size) >= MINSIZE && prev_inuse (old_top) && ((unsigned long) old_end & (pagesize - 1)) == 0)' failed.
[dev-aigc-20:51558] *** Process received signal ***
[dev-aigc-20:51558] Signal: Aborted (6)
[dev-aigc-20:51558] Signal code:  (-6)
[dev-aigc-20:51558] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7febb6f1b520]
[dev-aigc-20:51558] [ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7febb6f6f9fc]
[dev-aigc-20:51558] [ 2] /usr/lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7febb6f1b476]
[dev-aigc-20:51558] [ 3] /usr/lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7febb6f017f3]
[dev-aigc-20:51558] [ 4] /usr/lib/x86_64-linux-gnu/libc.so.6(+0xa0eca)[0x7febb6f79eca]
[dev-aigc-20:51558] [ 5] /usr/lib/x86_64-linux-gnu/libc.so.6(+0xa3937)[0x7febb6f7c937]
[dev-aigc-20:51558] [ 6] /usr/lib/x86_64-linux-gnu/libc.so.6(+0xa48dd)[0x7febb6f7d8dd]
[dev-aigc-20:51558] [ 7] /usr/lib/x86_64-linux-gnu/libc.so.6(malloc+0x99)[0x7febb6f7e139]
[dev-aigc-20:51558] [ 8] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(_Znwm+0x1c)[0x7feb8ce8298c]
[dev-aigc-20:51558] [ 9] /mnt/nfs/dev-aigc-2/data2/nidongwang/tmp/code/TensorRT-LLM/tensorrt_llm/libs/libth_common.so(_ZN9torch_ext15FtDynamicDecodeIfEC1Emmmii+0x26f)[0x7feb1b22e68f]
[dev-aigc-20:51558] [10] /mnt/nfs/dev-aigc-2/data2/nidongwang/tmp/code/TensorRT-LLM/tensorrt_llm/libs/libth_common.so(_ZN9torch_ext15DynamicDecodeOp14createInstanceEv+0xe7)[0x7feb1b21fe67]
[dev-aigc-20:51558] [11] /mnt/nfs/dev-aigc-2/data2/nidongwang/tmp/code/TensorRT-LLM/tensorrt_llm/libs/libth_common.so(_ZN9torch_ext15DynamicDecodeOpC2ElllllN3c1010ScalarTypeE+0x8c)[0x7feb1b2202ac]
[dev-aigc-20:51558] [12] /mnt/nfs/dev-aigc-2/data2/nidongwang/tmp/code/TensorRT-LLM/tensorrt_llm/libs/libth_common.so(_ZNSt17_Function_handlerIFvRSt6vectorIN3c106IValueESaIS2_EEEZN5torch6class_IN9torch_ext15DynamicDecodeOpEE12defineMethodIZNSB_3defIJlllllNS1_10ScalarTypeEEEERSB_NS7_6detail5typesIvJDpT_EEESsSt16initializer_listINS7_3argEEEUlNS1_14tagged_capsuleISA_EElllllSE_E_EEPNS7_3jit8FunctionESsT_SsSN_EUlS5_E_E9_M_invokeERKSt9_Any_dataS5_+0xf2)[0x7feb1b23e542]
[dev-aigc-20:51558] [13] /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so(+0x947e8e)[0x7feb8a847e8e]
[dev-aigc-20:51558] [14] /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so(+0x94542e)[0x7feb8a84542e]
[dev-aigc-20:51558] [15] /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so(+0x9474c9)[0x7feb8a8474c9]
[dev-aigc-20:51558] [16] /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so(+0x3f7304)[0x7feb8a2f7304]
[dev-aigc-20:51558] [17] /usr/bin/python(+0x15fe0e)[0x55f005ab2e0e]
[dev-aigc-20:51558] [18] /usr/bin/python(_PyObject_MakeTpCall+0x25b)[0x55f005aa95eb]
[dev-aigc-20:51558] [19] /usr/bin/python(+0x16e910)[0x55f005ac1910]
[dev-aigc-20:51558] [20] /usr/bin/python(+0x28420b)[0x55f005bd720b]
[dev-aigc-20:51558] [21] /usr/bin/python(_PyObject_MakeTpCall+0x25b)[0x55f005aa95eb]
[dev-aigc-20:51558] [22] /usr/bin/python(_PyEval_EvalFrameDefault+0x6aa1)[0x55f005aa21f1]
[dev-aigc-20:51558] [23] /usr/bin/python(_PyFunction_Vectorcall+0x7c)[0x55f005ab370c]
[dev-aigc-20:51558] [24] /usr/bin/python(_PyObject_FastCallDictTstate+0x16d)[0x55f005aa882d]
[dev-aigc-20:51558] [25] /usr/bin/python(+0x16a744)[0x55f005abd744]
[dev-aigc-20:51558] [26] /usr/bin/python(_PyObject_MakeTpCall+0x1fc)[0x55f005aa958c]
[dev-aigc-20:51558] [27] /usr/bin/python(_PyEval_EvalFrameDefault+0x71b8)[0x55f005aa2908]
[dev-aigc-20:51558] [28] /usr/bin/python(+0x16e4e1)[0x55f005ac14e1]
[dev-aigc-20:51558] [29] /usr/bin/python(_PyEval_EvalFrameDefault+0x1981)[0x55f005a9d0d1]
[dev-aigc-20:51558] *** End of error message ***
mfournioux commented 4 months ago

I have exactly the same problem when I try to use the convert script for a Mixtral 8 x 7B for conversion to int8 :

python3 convert_checkpoint.py --model_dir ./Mixtral8x7B/ \ --output_dir ./tllm_checkpoint_1gpu_fp16_wq/ \ --dtype float16 \ --use_weight_only \ --weight_only_precision int8

But this error appeared :

Traceback (most recent call last): File "/home/TensorRT-LLM/examples/llama/convert_checkpoint.py", line 1547, in main() File "/home/TensorRT-LLM/examples/llama/convert_checkpoint.py", line 1523, in main covert_and_save(rank, convert_args) File "/homeTensorRT-LLM//examples/llama/convert_checkpoint.py", line 1489, in covert_and_save weights = convert_hf_llama( File "/home/TensorRT-LLM/examples/llama/convert_checkpoint.py", line 983, in convert_hf_llama get_tllm_linear_weight(moe_experts_w2_weights, File "/home/TensorRT-LLM/examples/llama/convert_checkpoint.py", line 660, in get_tllm_linear_weight v = weight.t().contiguous() RuntimeError: t() expects a tensor with <= 2 dimensions, but self is 3D

Do you have any update on the resolution to this issue please?

Many thanks!

tombolano commented 3 months ago

I managed to quantize Mixtral 8x7B to 4 bpw.

I first tried running this command:

model="models--mistralai--Mixtral-8x7B-Instruct-v0.1"
model_dir="/models/$model"
model_chkpt_dir="/models/$model--trt-chkpt"

python3 TensorRT-LLM/examples/llama/convert_checkpoint.py \
    --model_dir "$model_dir" \
    --output_dir "$model_chkpt_dir" \
    --dtype float16 \
    --use_weight_only \
    --weight_only_precision int4 \
    --int8_kv_cache

and got the same error as @xesdiny:

RuntimeError: [TensorRT-LLM][ERROR] Assertion failed: Number of bytes for rows and cols must be a multiple of 32. However, num_rows_bytes = 2048 and num_col_bytes = 4. (/home/jenkins/agent/workspace/LLM/main/L0_MergeRequest/tensorrt_llm/cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_preprocessors.cpp:258)

There has been some changes in the library code since the comment of @xesdiny, the second change proposed by @xesdiny (disabling the router module) seems to be already implemented in models/llama/convert.py when a mixture of experts model is detected.

I tried applying the first changed proposed by @xesdiny (passing a False value to get_tllm_linear_weight), for clarity this is the diff of the change in the diff unified (-u) format:

--- convert.py  2024-03-25 19:38:29.042100000 +0100
+++ /usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/convert.py    2024-03-25 19:46:03.221558000 +0100
@@ -791,7 +789,7 @@
             weights.update(
                 get_tllm_linear_weight(
                     moe_experts_gate_weights.to(torch.float32),
-                    tllm_prex + 'mlp.router.', None, use_weight_only,
+                    tllm_prex + 'mlp.router.', None, False,
                     plugin_weight_only_quant_type, dtype, use_gemm_woq_plugin))
         else:
             mlp_gate_weight = get_weight(model_params, prefix + 'mlp.up_proj',

With that change I tried running again the conversion command and it succeeded. Then I built the engine without any error:

model_engine_dir="/models/$model--trt-engine"
trtllm-build --checkpoint_dir "$model_chkpt_dir" \
    --gemm_plugin float16 \
    --output_dir "$model_engine_dir"

Finally I tried the summarize example to test the engine, which also ran without error:

python3 TensorRT-LLM/examples/summarize.py --test_trt_llm \
    --hf_model_dir "$model_dir" \
    --data_type fp16 \
    --engine_dir "$model_engine_dir"
larin92 commented 3 months ago

I managed to quantize Mixtral 8x7B to 4 bpw.

I first tried running this command:

model="models--mistralai--Mixtral-8x7B-Instruct-v0.1"
model_dir="/models/$model"
model_chkpt_dir="/models/$model--trt-chkpt"

python3 TensorRT-LLM/examples/llama/convert_checkpoint.py \
    --model_dir "$model_dir" \
    --output_dir "$model_chkpt_dir" \
    --dtype float16 \
    --use_weight_only \
    --weight_only_precision int4 \
    --int8_kv_cache

and got the same error as @xesdiny:

RuntimeError: [TensorRT-LLM][ERROR] Assertion failed: Number of bytes for rows and cols must be a multiple of 32. However, num_rows_bytes = 2048 and num_col_bytes = 4. (/home/jenkins/agent/workspace/LLM/main/L0_MergeRequest/tensorrt_llm/cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_preprocessors.cpp:258)

There has been some changes in the library code since the comment of @xesdiny, the second change proposed by @xesdiny (disabling the router module) seems to be already implemented in models/llama/convert.py when a mixture of experts model is detected.

I tried applying the first changed proposed by @xesdiny (passing a False value to get_tllm_linear_weight), for clarity this is the diff of the change in the diff unified (-u) format:

--- convert.py    2024-03-25 19:38:29.042100000 +0100
+++ /usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/convert.py  2024-03-25 19:46:03.221558000 +0100
@@ -791,7 +789,7 @@
             weights.update(
                 get_tllm_linear_weight(
                     moe_experts_gate_weights.to(torch.float32),
-                    tllm_prex + 'mlp.router.', None, use_weight_only,
+                    tllm_prex + 'mlp.router.', None, False,
                     plugin_weight_only_quant_type, dtype, use_gemm_woq_plugin))
         else:
             mlp_gate_weight = get_weight(model_params, prefix + 'mlp.up_proj',

With that change I tried running again the conversion command and it succeeded. Then I built the engine without any error:

model_engine_dir="/models/$model--trt-engine"
trtllm-build --checkpoint_dir "$model_chkpt_dir" \
    --gemm_plugin float16 \
    --output_dir "$model_engine_dir"

Finally I tried the summarize example to test the engine, which also ran without error:

python3 TensorRT-LLM/examples/summarize.py --test_trt_llm \
    --hf_model_dir "$model_dir" \
    --data_type fp16 \
    --engine_dir "$model_engine_dir"

which setup did you run for quantization? i thought about quantizing like this "--weight_only_precision int4" as well, but haven't been able to find how to use multi-gpu setup when quantizing with tensorrt, and single gpu would need to have over 100gb vram (for Mixtral) i think

tombolano commented 3 months ago

Yes, this requires at least 100GB of VRAM. I executed the code on a system equipped with three Nvidia A100 GPUs, each with 40GB of VRAM, so 120GB in total. When running the convert_checkpoint.py script the model was automatically distributed across the three GPUs.