Error When Building With AutoGPTQ quantized models

jFkd1 commented 1 year ago

I received the following error when trying to compile an engine for an autogptq quantized Llama-2-13b-chat.

Traceback (most recent call last):
  File "/root/TensorRT-LLM/examples/llama/build.py", line 718, in <module>
    build(0, args)
  File "/root/TensorRT-LLM/examples/llama/build.py", line 689, in build
    engine = build_rank_engine(builder, builder_config, engine_name,
  File "/root/TensorRT-LLM/examples/llama/build.py", line 543, in build_rank_engine
    load_func(tensorrt_llm_llama=tensorrt_llm_llama,
  File "/root/TensorRT-LLM/examples/llama/weight.py", line 942, in load_from_gptq_llama
    idx].attention.qkv.scale.value = th_zero.numpy()
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/parameter.py", line 66, in value
    assert v.shape == self._value.shape, \
AssertionError: ('The value updated is not the same shape as the original. ', 'Updated: (160, 15360), original: (40, 15360)')

Does this mean auto gpt-q quantized models are not currently supported, and we would have to use ammo?

byshiue commented 1 year ago

Yes. TensorRT-LLM only supports model quantized by AMMO.

eycheung commented 1 year ago

@byshiue for GPTQ it seems like we are not supposed to use AMMO though https://github.com/NVIDIA/TensorRT-LLM/tree/release/0.5.0/examples/llama#gptq

Is this expected to change in the future?

@jFkd1 I can get AutoGPTQ to work for me using world_size=1, but I see the same issue when using world_size=4, tp_size=4. Are you using the same? If so, the fact that the dimension is off by a factor of 4 (world size) then seems suspicious. I am having a similar issue Updated: (1280, 1920), original: (5120, 1920) which is also off by a factor of 4.

Tracin commented 11 months ago

@eycheung @jFkd1 Please try add -1 as the third parameter of this function call. code If that works I will push an MR soon.

YooSungHyun commented 10 months ago

The example recommends using gptq-for-llama, but the author of gptq-for-llama recommends using auto_gptq. To actually quantize with AMMO, this would be a huge time sink for existing users of QLoRA. (You'd have to quantize int4 with AMMO and relearn QLoRA, and the examples aren't good enough).

Are there any plans to support auto_gptq?

YooSungHyun commented 10 months ago

raised error like this...

[01/29/2024-05:05:25] [TRT-LLM] [I] Loading weights from groupwise GPTQ LLaMA safetensors...
[01/29/2024-05:05:26] [TRT-LLM] [I] Process weights in layer: 0
Traceback (most recent call last):
  File "/workspace/tensorrtllm_backend/tensorrt_llm/examples/llama/build.py", line 1092, in <module>
    build(0, args)
  File "/workspace/tensorrtllm_backend/tensorrt_llm/examples/llama/build.py", line 1037, in build
    engine = build_rank_engine(builder, builder_config, engine_name,
  File "/workspace/tensorrtllm_backend/tensorrt_llm/examples/llama/build.py", line 912, in build_rank_engine
    tensorrt_llm_llama = get_model_object(args,
  File "/workspace/tensorrtllm_backend/tensorrt_llm/examples/llama/build.py", line 786, in get_model_object
    load_from_gptq_llama(tensorrt_llm_llama=tensorrt_llm_llama,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/weight.py", line 1305, in load_from_gptq_llama
    process_and_assign_weight(layer.attention.qkv, qkv_weight_list)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/weight.py", line 1260, in process_and_assign_weight
    mOp.weights_scaling_factor.value = scales_fp16.cpu().numpy()
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/parameter.py", line 115, in value
    assert v.shape == self._shape, \
AssertionError: The value updated is not the same shape as the original. Updated: (1, 15360), original: (40, 15360)

qwen is supported auto-gptq.... how do that on llama v2?

nv-guomingz commented 1 week ago

do u still have further issue or question now? If not, we'll close it soon.

NVIDIA / TensorRT-LLM

Error When Building With AutoGPTQ quantized models #387