Error when quantize llama-2-7b-chat-hf with format `int4_awq`

gesanqiu commented 1 year ago

build tensorrt_llm/release:latest according to https://github.com/NVIDIA/TensorRT-LLM/blob/release/0.5.0/docs/source/installation.md#build-tensorrt-llm-in-one-step
install AMMO according to https://github.com/NVIDIA/TensorRT-LLM/blob/release/0.5.0/examples/quantization/README.md#installation

nvidia-smi output:


root@dell:/workdir/TensorRT-LLM/examples/llama# nvidia-smi
Tue Oct 24 08:18:39 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.12             Driver Version: 535.104.12   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A40                     Off | 00000000:17:00.0 Off |                    0 |
|  0%   39C    P0              78W / 300W |   1828MiB / 46068MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A40                     Off | 00000000:65:00.0 Off |                    0 |
|  0%   23C    P8              12W / 300W |      7MiB / 46068MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA A40                     Off | 00000000:CA:00.0 Off |                    0 |
|  0%   39C    P0              78W / 300W |  43606MiB / 46068MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA A40                     Off | 00000000:E3:00.0 Off |                    0 |
|  0%   24C    P8              13W / 300W |      7MiB / 46068MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| +---------------------------------------------------------------------------------------+


- weight quantization command line:

python quantize.py --model_dir /workdir/hf_models/llama-2-7b-chat-hf/ --dtype float16 --qformat int4_awq --export_path ./llama-7b-4bit-gs128-awq.pt --calib_size 32

and the output logs shown below:

x_lengthis ignored whenpadding=Trueand there is no truncation strategy. To pad to max length, usepadding='max_length'`. warnings.warn(

Replaced 675 modules to quantized modules Caching activation statistics for awq_lite... ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ /workdir/TensorRT-LLM/examples/llama/quantize.py:135 in │ │ │ │ 132 │ │ 133 │ │ 134 if name == "main": │ │ ❱ 135 │ main() │ │ 136 │ │ │ │ /workdir/TensorRT-LLM/examples/llama/quantize.py:128 in main │ │ │ │ 125 │ │ │ 126 │ calib_dataloader = get_calib_dataloader(tokenizer=tokenizer, │ │ 127 │ │ │ │ │ │ │ │ │ │ │ calib_size=args.calib_size) │ │ ❱ 128 │ model = quantize_and_export(model, │ │ 129 │ │ │ │ │ │ │ │ qformat=args.qformat, │ │ 130 │ │ │ │ │ │ │ │ calib_dataloader=calib_dataloader, │ │ 131 │ │ │ │ │ │ │ │ export_path=args.export_path) │ │ │ │ /usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/quantized/ammo.py:79 in │ │ quantize_and_export │ │ │ │ 76 │ │ raise NotImplementedError( │ │ 77 │ │ │ f"Deploying quantized model {model_cls_name} is not supported") │ │ 78 │ │ │ ❱ 79 │ model = _quantize_model(model, │ │ 80 │ │ │ │ │ │ │ qformat=qformat, │ │ 81 │ │ │ │ │ │ │ calib_dataloader=calib_dataloader) │ │ 82 │ │ │ │ /usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/quantized/ammo.py:55 in │ │ _quantize_model │ │ │ │ 52 │ │ │ model(data) │ │ 53 │ │ │ 54 │ logger.debug("Starting quantization...") │ │ ❱ 55 │ atq.quantize(model, quant_cfg, forward_loop=calibrate_loop) │ │ 56 │ logger.debug("Quantization done") │ │ 57 │ return model │ │ 58 │ │ │ │ /usr/local/lib/python3.10/dist-packages/ammo/torch/quantization/model_quant.py:114 in quantize │ │ │ │ 111 │ """ │ │ 112 │ replace_quant_module(model) │ │ 113 │ set_quantizer_by_cfg(model, config["quant_cfg"]) │ │ ❱ 114 │ calibrate(model, config["algorithm"], forward_loop=forward_loop) │ │ 115 │ return model │ │ 116 │ │ 117 │ │ │ │ in ammo.torch.quantization.model_calib.calibrate:60 │ │ │ │ in ammo.torch.quantization.model_calib.awq:182 │ │ │ │ /usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py:115 in decorate_context │ │ │ │ 112 │ @functools.wraps(func) │ │ 113 │ def decorate_context(*args, kwargs): │ │ 114 │ │ with ctx_factory(): │ │ ❱ 115 │ │ │ return func(*args, *kwargs) │ │ 116 │ │ │ 117 │ return decorate_context │ │ 118 │ │ │ │ in ammo.torch.quantization.model_calib.awq_lite:294 │ │ │ │ /usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/quantized/ammo.py:52 in │ │ calibrate_loop │ │ │ │ 49 │ │ """Adjusts weights and scaling factors based on selected algorithms.""" │ │ 50 │ │ for idx, data in enumerate(calib_dataloader): │ │ 51 │ │ │ logger.debug(f"Calibrating batch {idx}") │ │ ❱ 52 │ │ │ model(data) │ │ 53 │ │ │ 54 │ logger.debug("Starting quantization...") │ │ 55 │ atq.quantize(model, quant_cfg, forward_loop=calibrate_loop) │ │ │ │ /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1505 in _wrapped_call_impl │ │ │ │ 1502 │ │ if self._compiled_call_impl is not None: │ │ 1503 │ │ │ return self._compiled_call_impl(args, kwargs) # type: ignore[misc] │ │ 1504 │ │ else: │ │ ❱ 1505 │ │ │ return self._call_impl(*args, kwargs) │ │ 1506 │ │ │ 1507 │ def _call_impl(self, *args, *kwargs): │ │ 1508 │ │ forward_call = (self._slow_forward if torch._C._get_tracing_state() else self.fo │ │ │ │ /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1514 in _call_impl │ │ │ │ 1511 │ │ if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks │ │ 1512 │ │ │ │ or _global_backward_pre_hooks or _global_backward_hooks │ │ 1513 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks): │ │ ❱ 1514 │ │ │ return forward_call(args, kwargs) │ │ 1515 │ │ # Do not call functions when jit is used │ │ 1516 │ │ full_backward_hooks, non_full_backward_hooks = [], [] │ │ 1517 │ │ backward_pre_hooks = [] │ │ │ │ /usr/local/lib/python3.10/dist-packages/accelerate/hooks.py:165 in new_forward │ │ │ │ 162 │ │ │ with torch.no_grad(): │ │ 163 │ │ │ │ output = old_forward(*args, kwargs) │ │ 164 │ │ else: │ │ ❱ 165 │ │ │ output = old_forward(*args, *kwargs) │ │ 166 │ │ return module._hf_hook.post_forward(module, output) │ │ 167 │ │ │ 168 │ module.forward = new_forward │ │ │ │ /usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py:806 in │ │ forward │ │ │ │ 803 │ │ return_dict = return_dict if return_dict is not None else self.config.use_return │ │ 804 │ │ │ │ 805 │ │ # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn) │ │ ❱ 806 │ │ outputs = self.model( │ │ 807 │ │ │ input_ids=input_ids, │ │ 808 │ │ │ attention_mask=attention_mask, │ │ 809 │ │ │ position_ids=position_ids, │ │ │ │ /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1505 in _wrapped_call_impl │ │ │ │ 1502 │ │ if self._compiled_call_impl is not None: │ │ 1503 │ │ │ return self._compiled_call_impl(args, kwargs) # type: ignore[misc] │ │ 1504 │ │ else: │ │ ❱ 1505 │ │ │ return self._call_impl(*args, kwargs) │ │ 1506 │ │ │ 1507 │ def _call_impl(self, *args, *kwargs): │ │ 1508 │ │ forward_call = (self._slow_forward if torch._C._get_tracing_state() else self.fo │ │ │ │ /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1514 in _call_impl │ │ │ │ 1511 │ │ if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks │ │ 1512 │ │ │ │ or _global_backward_pre_hooks or _global_backward_hooks │ │ 1513 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks): │ │ ❱ 1514 │ │ │ return forward_call(args, kwargs) │ │ 1515 │ │ # Do not call functions when jit is used │ │ 1516 │ │ full_backward_hooks, non_full_backward_hooks = [], [] │ │ 1517 │ │ backward_pre_hooks = [] │ │ │ │ /usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py:693 in │ │ forward │ │ │ │ 690 │ │ │ │ │ None, │ │ 691 │ │ │ │ ) │ │ 692 │ │ │ else: │ │ ❱ 693 │ │ │ │ layer_outputs = decoder_layer( │ │ 694 │ │ │ │ │ hidden_states, │ │ 695 │ │ │ │ │ attention_mask=attention_mask, │ │ 696 │ │ │ │ │ position_ids=position_ids, │ │ │ │ /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1505 in _wrapped_call_impl │ │ │ │ 1502 │ │ if self._compiled_call_impl is not None: │ │ 1503 │ │ │ return self._compiled_call_impl(*args, kwargs) # type: ignore[misc] │ │ 1504 │ │ else: │ │ ❱ 1505 │ │ │ return self._call_impl(*args, *kwargs) │ │ 1506 │ │ │ 1507 │ def _call_impl(self, args, kwargs): │ │ 1508 │ │ forward_call = (self._slow_forward if torch._C._get_tracing_state() else self.fo │ │ │ │ /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1514 in _call_impl │ │ │ │ 1511 │ │ if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks │ │ 1512 │ │ │ │ or _global_backward_pre_hooks or _global_backward_hooks │ │ 1513 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks): │ │ ❱ 1514 │ │ │ return forward_call(*args, kwargs) │ │ 1515 │ │ # Do not call functions when jit is used │ │ 1516 │ │ full_backward_hooks, non_full_backward_hooks = [], [] │ │ 1517 │ │ backward_pre_hooks = [] │ │ │ │ /usr/local/lib/python3.10/dist-packages/accelerate/hooks.py:165 in new_forward │ │ │ │ 162 │ │ │ with torch.no_grad(): │ │ 163 │ │ │ │ output = old_forward(*args, *kwargs) │ │ 164 │ │ else: │ │ ❱ 165 │ │ │ output = old_forward(args, kwargs) │ │ 166 │ │ return module._hf_hook.post_forward(module, output) │ │ 167 │ │ │ 168 │ module.forward = new_forward │ │ │ │ /usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py:408 in │ │ forward │ │ │ │ 405 │ │ hidden_states = self.input_layernorm(hidden_states) │ │ 406 │ │ │ │ 407 │ │ # Self Attention │ │ ❱ 408 │ │ hidden_states, self_attn_weights, present_key_value = self.self_attn( │ │ 409 │ │ │ hidden_states=hidden_states, │ │ 410 │ │ │ attention_mask=attention_mask, │ │ 411 │ │ │ position_ids=position_ids, │ │ │ │ /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1505 in _wrapped_call_impl │ │ │ │ 1502 │ │ if self._compiled_call_impl is not None: │ │ 1503 │ │ │ return self._compiled_call_impl(*args, kwargs) # type: ignore[misc] │ │ 1504 │ │ else: │ │ ❱ 1505 │ │ │ return self._call_impl(*args, *kwargs) │ │ 1506 │ │ │ 1507 │ def _call_impl(self, args, kwargs): │ │ 1508 │ │ forward_call = (self._slow_forward if torch._C._get_tracing_state() else self.fo │ │ │ │ /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1514 in _call_impl │ │ │ │ 1511 │ │ if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks │ │ 1512 │ │ │ │ or _global_backward_pre_hooks or _global_backward_hooks │ │ 1513 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks): │ │ ❱ 1514 │ │ │ return forward_call(*args, kwargs) │ │ 1515 │ │ # Do not call functions when jit is used │ │ 1516 │ │ full_backward_hooks, non_full_backward_hooks = [], [] │ │ 1517 │ │ backward_pre_hooks = [] │ │ │ │ /usr/local/lib/python3.10/dist-packages/accelerate/hooks.py:165 in new_forward │ │ │ │ 162 │ │ │ with torch.no_grad(): │ │ 163 │ │ │ │ output = old_forward(*args, *kwargs) │ │ 164 │ │ else: │ │ ❱ 165 │ │ │ output = old_forward(args, kwargs) │ │ 166 │ │ return module._hf_hook.post_forward(module, output) │ │ 167 │ │ │ 168 │ module.forward = new_forward │ │ │ │ /usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py:305 in │ │ forward │ │ │ │ 302 │ │ │ value_states = torch.cat(value_states, dim=-1) │ │ 303 │ │ │ │ 304 │ │ else: │ │ ❱ 305 │ │ │ query_states = self.q_proj(hidden_states) │ │ 306 │ │ │ key_states = self.k_proj(hidden_states) │ │ 307 │ │ │ value_states = self.v_proj(hidden_states) │ │ 308 │ │ │ │ /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1505 in _wrapped_call_impl │ │ │ │ 1502 │ │ if self._compiled_call_impl is not None: │ │ 1503 │ │ │ return self._compiled_call_impl(*args, kwargs) # type: ignore[misc] │ │ 1504 │ │ else: │ │ ❱ 1505 │ │ │ return self._call_impl(*args, *kwargs) │ │ 1506 │ │ │ 1507 │ def _call_impl(self, args, kwargs): │ │ 1508 │ │ forward_call = (self._slow_forward if torch._C._get_tracing_state() else self.fo │ │ │ │ /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1514 in _call_impl │ │ │ │ 1511 │ │ if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks │ │ 1512 │ │ │ │ or _global_backward_pre_hooks or _global_backward_hooks │ │ 1513 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks): │ │ ❱ 1514 │ │ │ return forward_call(*args, kwargs) │ │ 1515 │ │ # Do not call functions when jit is used │ │ 1516 │ │ full_backward_hooks, non_full_backward_hooks = [], [] │ │ 1517 │ │ backward_pre_hooks = [] │ │ │ │ in ammo.torch.quantization.model_calib.awq_lite.forward:256 │ │ │ │ /usr/local/lib/python3.10/dist-packages/ammo/torch/quantization/nn/modules/quant_module.py:58 in │ │ forward │ │ │ │ 55 │ │ │ │ 56 │ │ def forward(self, input, *args, *kwargs): │ │ 57 │ │ │ self.dict["weight"] = self.weight_quantizer(self.weight) │ │ ❱ 58 │ │ │ output = self._original_forward(self.input_quantizer(input), args, kwargs │ │ 59 │ │ │ del self.dict["weight"] │ │ 60 │ │ │ if isinstance(output, tuple): │ │ 61 │ │ │ │ return (self.output_quantizer(output[0]), output[1:]) │ │ │ │ /usr/local/lib/python3.10/dist-packages/accelerate/hooks.py:165 in new_forward │ │ │ │ 162 │ │ │ with torch.no_grad(): │ │ 163 │ │ │ │ output = old_forward(args, *kwargs) │ │ 164 │ │ else: │ │ ❱ 165 │ │ │ output = old_forward(args, **kwargs) │ │ 166 │ │ return module._hf_hook.post_forward(module, output) │ │ 167 │ │ │ 168 │ module.forward = newforward │ │ │ │ /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114 in forward │ │ │ │ 111 │ │ │ init.uniform(self.bias, -bound, bound) │ │ 112 │ │ │ 113 │ def forward(self, input: Tensor) -> Tensor: │ │ ❱ 114 │ │ return F.linear(input, self.weight, self.bias) │ │ 115 │ │ │ 116 │ def extra_repr(self) -> str: │ │ 117 │ │ return 'in_features={}, out_features={}, bias={}'.format( │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ RuntimeError: CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16F, lda, b, CUDA_R_16F, ldb, &fbeta, c, CUDA_R_16F, ldc, CUDA_R_32F, CUBLAS_GEMM_DEFAULT_TENSOR_OP)

jdemouth-nvidia commented 1 year ago

I’m forwarding to our colleagues from the AMMO team. Thanks.

BasicCoder commented 1 year ago

please reference #46.

jdemouth-nvidia commented 1 year ago

please reference #46.

What’s the connection? Sorry, but that’s not clear to me. Can you explain a bit more, please :)

BasicCoder commented 1 year ago

please reference #46.

What’s the connection? Sorry, but that’s not clear to me. Can you explain a bit more, please :)

Sorry, I made a mistake about the driver version. Maybe the 535.86 driver version (driver version in base image) and the 535.104 driver version are compatible. This error may be caused by a compatibility error or OOM. You can try to reduce the calib_size size.

RalphMao commented 1 year ago

Hi @BasicCoder , just to double check, does https://github.com/NVIDIA/TensorRT-LLM/pull/46 fix this error?

I took a glance at the error message and it seems unrelated to quantization. To verify that, you can call forward_loop before quantization to see if the same error is raised, like this:

# In /usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/quantized/ammo.py:55

logger.debug("Starting quantization...")
forward_loop()
atq.quantize(model, quant_cfg, forward_loop=calibrate_loop)

Otherwise if it is indeed an AMMO library compatibility issue, we will release an AMMO wheel that does compilation on the fly to avoid potential library conflict

BasicCoder commented 1 year ago

Hi @BasicCoder , just to double check, does #46 fix this error?

I took a glance at the error message and it seems unrelated to quantization. To verify that, you can call forward_loop before quantization to see if the same error is raised, like this:
# In /usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/quantized/ammo.py:55

logger.debug("Starting quantization...")
forward_loop()
atq.quantize(model, quant_cfg, forward_loop=calibrate_loop)
Otherwise if it is indeed an AMMO library compatibility issue, we will release an AMMO wheel that does compilation on the fly to avoid potential library conflict

RuntimeError: CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling ‘cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16F, lda, b, CUDA_R_16F, ldb, &fbeta, c, CUDA_R_16F, ldc, CUDA_R_32F, CUBLAS_GEMM_DEFAULT_TENSOR_OP)’ First of all, according to my experience, this error is often caused by the incompatibility between the CUDA version and the pytorch cuda version. You can also search for this type of error in the pytorch community or stackoverflow. I am very sure that this error has nothing to do with AMMO. According to the call The stack, you can also see it at the entrance of pytorch F.linear. Second, according to AMMO's instructions, I did not encounter any AMMO errors. Of course, I turned on compatibility according to #46

RalphMao commented 1 year ago

@BasicCoder We have uploaded a new tar ball that contains from-source wheel files, which does the compilation on the fly

byshiue commented 11 months ago

Close this bug because the issue is inactivated. Feel free to ask here if you still have question/issue, we will reopen the issue.

NVIDIA / TensorRT-LLM

Error when quantize llama-2-7b-chat-hf with format `int4_awq` #91