Llama-2-7b - Tensor Dimension Mismatch Error for AWQ Engine Build for 4 GPUs

taozhang9527 commented 9 months ago

Model under test: Llama-2-7b-chat-hf

Following the instructions here, was able to quantize the model and build engine for 1 gpu scenario, but tensor dimension mismatch error happened when building for 4GPUs with TP.

Command: python examples/llama/build.py --model_dir ./Llama-2-7b-chat-hf --quant_ckpt_path ./Llama-2-7b-chat-hf_awq/llama-7b-4bit-gs128-awq.pt --dtype float16 --remove_input_padding --use_gpt_attention_plugin float16 --enable_context_fmha --use_gemm_plugin float16 --use_weight_only --weight_only_precision int4_awq --per_group --world_size 4 --tp_size 4 --output_dir ./examples/llama/out/7b/awq_4gpu/

Error: ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ /code/tensorrt_llm/examples/llama/build.py:718 in <module> │ │ │ │ 715 │ else: │ │ 716 │ │ args.parallel_build = False │ │ 717 │ │ logger.info('Serially build TensorRT engines.') │ │ ❱ 718 │ │ build(0, args) │ │ 719 │ │ │ 720 │ tok = time.time() │ │ 721 │ t = time.strftime('%H:%M:%S', time.gmtime(tok - tik)) │ │ │ │ /code/tensorrt_llm/examples/llama/build.py:689 in build │ │ │ │ 686 │ │ │ opt_level=args.builder_opt) │ │ 687 │ │ engine_name = get_engine_name(MODEL_NAME, args.dtype, args.tp_size, │ │ 688 │ │ │ │ │ │ │ │ │ args.pp_size, cur_rank) │ │ ❱ 689 │ │ engine = build_rank_engine(builder, builder_config, engine_name, │ │ 690 │ │ │ │ │ │ │ │ cur_rank, args) │ │ 691 │ │ assert engine is not None, f'Failed to build engine for rank {cur_rank}' │ │ 692 │ │ │ │ /code/tensorrt_llm/examples/llama/build.py:543 in build_rank_engine │ │ │ │ 540 │ │ │ │ │ │ │ │ │ │ quant_scales=quant_scales) │ │ 541 │ if args.per_group: │ │ 542 │ │ load_func = load_from_awq_llama if args.weight_only_precision == 'int4_awq' else │ │ ❱ 543 │ │ load_func(tensorrt_llm_llama=tensorrt_llm_llama, │ │ 544 │ │ │ │ quant_ckpt_path=args.quant_ckpt_path, │ │ 545 │ │ │ │ mapping=mapping, │ │ 546 │ │ │ │ dtype=args.dtype) │ │ │ │ /code/tensorrt_llm/examples/llama/weight.py:1237 in load_from_awq_llama │ │ │ │ 1234 │ │ # MLP down_proj (mlp.proj) Linear │ │ 1235 │ │ mPrefix = prefix + "mlp.down_proj" │ │ 1236 │ │ mOp = tensorrt_llm_llama.layers[layer_idx].mlp.proj │ │ ❱ 1237 │ │ process_and_assign_weight(awq_llama, mPrefix, mOp, 0) │ │ 1238 │ │ │ │ 1239 │ │ # MLP gate_proj (mlp.fc) Linear │ │ 1240 │ │ mPrefix = prefix + "mlp.gate_proj" │ │ │ │ /code/tensorrt_llm/examples/llama/weight.py:1108 in process_and_assign_weight │ │ │ │ 1105 │ │ │ pre_quant_scale = pre_quant_scale.split(k // mapping.tp_size, │ │ 1106 │ │ │ │ │ │ │ │ │ │ │ │ │ dim=1)[mapping.tp_rank] │ │ 1107 │ │ scale = amax / 8.0 │ │ ❱ 1108 │ │ mOp.qweight.value = AWQ_quantize_pack_preprocess(weight, scale) │ │ 1109 │ │ mOp.scale.value = scale.to(torch_dtype).cpu().numpy() │ │ 1110 │ │ mOp.pre_quant_scale.value = pre_quant_scale.to( │ │ 1111 │ │ │ torch_dtype).cpu().numpy() │ │ │ │ │ │ /code/tensorrt_llm/examples/llama/weight.py:1087 in AWQ_quantize_pack_preprocess │ │ │ │ 1084 │ │ │ 1085 │ def AWQ_quantize_pack_preprocess(weight, scale): │ │ 1086 │ │ scale = scale.repeat_interleave(group_size, dim=0) │ │ ❱ 1087 │ │ weight = weight / scale │ │ 1088 │ │ qweight_int8 = torch.clamp(torch.round(weight.cuda()).char(), -8, 7) │ │ 1089 │ │ int4_weight = packer(qweight_int8.cpu()) │ │ 1090 │ │ int4_weight = preprocessor(int4_weight, torch.quint4x2) │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ RuntimeError: The size of tensor a (2752) must match the size of tensor b (2688) at non-singleton dimension 0

jdemouth-nvidia commented 9 months ago

Can you test if the issue persists in the main branch, please? I see fixes for Llama AWQ TP > 1 in our internal repo that were pushed to the main branch but are not in release/0.5.0.

taozhang9527 commented 9 months ago

Do I need to rebuild the docker image for the main branch? I noticed that there were some files updated under docker folder. From the notes, it seems there are updates on both main branch and 0.5 release that are not synced. This branch is [20 commits ahead](https://github.com/NVIDIA/TensorRT-LLM/compare/release/0.5.0...main), [23 commits behind](https://github.com/NVIDIA/TensorRT-LLM/compare/main...release/0.5.0) release/0.5.0. When will the next official version be available?

taozhang9527 commented 9 months ago

I reused the docker image built for 0.5.0 branch and generated a new container with the following command: REPOSITORY TAG IMAGE ID CREATED SIZE tensorrt_llm/release latest-root 5e43c4749c11 41 hours ago 26.8GB docker run --name tensorrt-llm-main_test --privileged -idt --net=host --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --gpus=all -v /mnt/tao-new/TensorRT-LLM-main:/code/tensorrt_llm 5e43c4749c11 bash

But I got tensorrt_llm module not found error when I tried to build the engine: python examples/llama/build.py --model_dir ./Llama-2-7b-chat-hf --quant_ckpt_path ./Llama-2-7b-chat-hf_awq/llama-7b-4bit-gs128-awq.pt --dtype float16 --remove_input_padding --use_gpt_attention_plugin float16 --enable_context_fmha --use_gemm_plugin float16 --use_weight_only --weight_only_precision int4_awq --per_group --world_size 4 --tp_size 4 --output_dir ./examples/llama/out/7b/awq_4gpu/ ModuleNotFoundError: No module named 'tensorrt_llm.runtime.lora_manager'

Shouldn't it be installed inside the docker image already?

taozhang9527 commented 8 months ago

Tried with the latest main branch code and rebuilt the docker image. TP size = 4 still gave me the following errors:

[12/12/2023-00:21:08] [TRT-LLM] [E] Current weight shape is invalid for mapping.tp_size=4
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /code/tensorrt_llm/examples/llama/build.py:839 in <module>                                       │
│                                                                                                  │
│   836 │   else:                                                                                  │
│   837 │   │   args.parallel_build = False                                                        │
│   838 │   │   logger.info('Serially build TensorRT engines.')                                    │
│ ❱ 839 │   │   build(0, args)                                                                     │
│   840 │                                                                                          │
│   841 │   tok = time.time()                                                                      │
│   842 │   t = time.strftime('%H:%M:%S', time.gmtime(tok - tik))                                  │
│                                                                                                  │
│ /code/tensorrt_llm/examples/llama/build.py:783 in build                                          │
│                                                                                                  │
│   780 │   │   )                                                                                  │
│   781 │   │   engine_name = get_engine_name(MODEL_NAME, args.dtype, args.tp_size,                │
│   782 │   │   │   │   │   │   │   │   │     args.pp_size, cur_rank)                              │
│ ❱ 783 │   │   engine = build_rank_engine(builder, builder_config, engine_name,                   │
│   784 │   │   │   │   │   │   │   │      cur_rank, args)                                         │
│   785 │   │   assert engine is not None, f'Failed to build engine for rank {cur_rank}'           │
│   786                                                                                            │
│ /code/tensorrt_llm/examples/llama/build.py:602 in build_rank_engine                              │
│                                                                                                  │
│   599 │   │   │   │   │   │   │   │   │   │   **quantize_kwargs)                                 │
│   600 │   if args.per_group:                                                                     │
│   601 │   │   load_func = load_from_awq_llama if args.weight_only_precision == 'int4_awq' else   │
│ ❱ 602 │   │   load_func(tensorrt_llm_llama=tensorrt_llm_llama,                                   │
│   603 │   │   │   │     quant_ckpt_path=args.quant_ckpt_path,                                    │
│   604 │   │   │   │     mapping=mapping,                                                         │
│   605 │   │   │   │     dtype=args.dtype,                                                        │
│                                                                                                  │
│ /code/tensorrt_llm/examples/llama/weight.py:1465 in load_from_awq_llama                          │
│                                                                                                  │
│   1462 │   │                                                                                     │
│   1463 │   │   # 4.4 mlp.proj                                                                    │
│   1464 │   │   v = [load(prefix + awq_key_list[7] + suf) for suf in awq_suffix_list]             │
│ ❱ 1465 │   │   process_and_assign_weight(layer.mlp.proj, v, 0)                                   │
│   1466 │   │                                                                                     │
│   1467 │   │   # 4.5 mlp.fc                                                                      │
│   1468 │   │   v = [load(prefix + awq_key_list[8] + suf) for suf in awq_suffix_list]             │
│                                                                                                  │
│ /code/tensorrt_llm/examples/llama/weight.py:1350 in process_and_assign_weight                    │
│                                                                                                  │
│   1347 │   │   [k, n] = weight.shape                                                             │
│   1348 │   │   weight = torch_split(weight, tp_dim)                                              │
│   1349 │   │   amax = v[1].reshape((n, k // group_size)).T.contiguous()                          │
│ ❱ 1350 │   │   amax = torch_split(amax, tp_dim)                                                  │
│   1351 │   │   pre_quant_scale = v[2].reshape((1, k))                                            │
│   1352 │   │   if tp_dim == 0:                                                                  
│   1353 │   │   │   pre_quant_scale = torch_split(pre_quant_scale, 1)                             │
│                                                                                                  │
│ /code/tensorrt_llm/examples/llama/weight.py:1335 in torch_split                                  │
│                                                                                                  │
│   1332 │   │   │   tensorrt_llm.logger.error(                                                    │
│   1333 │   │   │   │   "Current weight shape is invalid for mapping.tp_size=" +                  │
│   1334 │   │   │   │   str(mapping.tp_size))                                                     │
│ ❱ 1335 │   │   │   assert False, "Invalid TP size"                                               │
│   1336 │   │   return v.split(v.shape[dim] // mapping.tp_size,                                   │
│   1337 │   │   │   │   │      dim=dim)[mapping.tp_rank]                                          │
│   1338                                                                                           │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
AssertionError: Invalid TP size

Single GPU built is successful just like before.

taozhang9527 commented 8 months ago

In 0.5 release, the generated file is .pt. Why in this version, it is generated as .npz file?

byshiue commented 3 months ago

Could you take a try on latest version?

NVIDIA / TensorRT-LLM

Llama-2-7b - Tensor Dimension Mismatch Error for AWQ Engine Build for 4 GPUs #522