Closed gesanqiu closed 11 months ago
I’m forwarding to our colleagues from the AMMO team. Thanks.
please reference #46.
please reference #46.
What’s the connection? Sorry, but that’s not clear to me. Can you explain a bit more, please :)
please reference #46.
What’s the connection? Sorry, but that’s not clear to me. Can you explain a bit more, please :)
Sorry, I made a mistake about the driver version. Maybe the 535.86 driver version (driver version in base image) and the 535.104 driver version are compatible. This error may be caused by a compatibility error or OOM. You can try to reduce the calib_size size.
Hi @BasicCoder , just to double check, does https://github.com/NVIDIA/TensorRT-LLM/pull/46 fix this error?
I took a glance at the error message and it seems unrelated to quantization. To verify that, you can call forward_loop
before quantization to see if the same error is raised, like this:
# In /usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/quantized/ammo.py:55
logger.debug("Starting quantization...")
forward_loop()
atq.quantize(model, quant_cfg, forward_loop=calibrate_loop)
Otherwise if it is indeed an AMMO library compatibility issue, we will release an AMMO wheel that does compilation on the fly to avoid potential library conflict
Hi @BasicCoder , just to double check, does #46 fix this error?
I took a glance at the error message and it seems unrelated to quantization. To verify that, you can call
forward_loop
before quantization to see if the same error is raised, like this:# In /usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/quantized/ammo.py:55 logger.debug("Starting quantization...") forward_loop() atq.quantize(model, quant_cfg, forward_loop=calibrate_loop)
Otherwise if it is indeed an AMMO library compatibility issue, we will release an AMMO wheel that does compilation on the fly to avoid potential library conflict
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling ‘cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16F, lda, b, CUDA_R_16F, ldb, &fbeta, c, CUDA_R_16F, ldc, CUDA_R_32F, CUBLAS_GEMM_DEFAULT_TENSOR_OP)’
First of all, according to my experience, this error is often caused by the incompatibility between the CUDA version and the pytorch cuda version. You can also search for this type of error in the pytorch community or stackoverflow. I am very sure that this error has nothing to do with AMMO. According to the call The stack, you can also see it at the entrance of pytorch F.linear.
Second, according to AMMO's instructions, I did not encounter any AMMO errors. Of course, I turned on compatibility according to #46
@BasicCoder We have uploaded a new tar ball that contains from-source wheel files, which does the compilation on the fly
Close this bug because the issue is inactivated. Feel free to ask here if you still have question/issue, we will reopen the issue.
tensorrt_llm/release:latest
according to https://github.com/NVIDIA/TensorRT-LLM/blob/release/0.5.0/docs/source/installation.md#build-tensorrt-llm-in-one-stepnvidia-smi
output:+---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| +---------------------------------------------------------------------------------------+
python quantize.py --model_dir /workdir/hf_models/llama-2-7b-chat-hf/ --dtype float16 --qformat int4_awq --export_path ./llama-7b-4bit-gs128-awq.pt --calib_size 32
x_length
is ignored when
padding=
Trueand there is no truncation strategy. To pad to max length, use
padding='max_length'`. warnings.warn(Replaced 675 modules to quantized modules Caching activation statistics for awq_lite... ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ /workdir/TensorRT-LLM/examples/llama/quantize.py:135 in │
│ │
│ 132 │
│ 133 │
│ 134 if name == "main": │
│ ❱ 135 │ main() │
│ 136 │
│ │
│ /workdir/TensorRT-LLM/examples/llama/quantize.py:128 in main │
│ │
│ 125 │ │
│ 126 │ calib_dataloader = get_calib_dataloader(tokenizer=tokenizer, │
│ 127 │ │ │ │ │ │ │ │ │ │ │ calib_size=args.calib_size) │
│ ❱ 128 │ model = quantize_and_export(model, │
│ 129 │ │ │ │ │ │ │ │ qformat=args.qformat, │
│ 130 │ │ │ │ │ │ │ │ calib_dataloader=calib_dataloader, │
│ 131 │ │ │ │ │ │ │ │ export_path=args.export_path) │
│ │
│ /usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/quantized/ammo.py:79 in │
│ quantize_and_export │
│ │
│ 76 │ │ raise NotImplementedError( │
│ 77 │ │ │ f"Deploying quantized model {model_cls_name} is not supported") │
│ 78 │ │
│ ❱ 79 │ model = _quantize_model(model, │
│ 80 │ │ │ │ │ │ │ qformat=qformat, │
│ 81 │ │ │ │ │ │ │ calib_dataloader=calib_dataloader) │
│ 82 │
│ │
│ /usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/quantized/ammo.py:55 in │
│ _quantize_model │
│ │
│ 52 │ │ │ model(data) │
│ 53 │ │
│ 54 │ logger.debug("Starting quantization...") │
│ ❱ 55 │ atq.quantize(model, quant_cfg, forward_loop=calibrate_loop) │
│ 56 │ logger.debug("Quantization done") │
│ 57 │ return model │
│ 58 │
│ │
│ /usr/local/lib/python3.10/dist-packages/ammo/torch/quantization/model_quant.py:114 in quantize │
│ │
│ 111 │ """ │
│ 112 │ replace_quant_module(model) │
│ 113 │ set_quantizer_by_cfg(model, config["quant_cfg"]) │
│ ❱ 114 │ calibrate(model, config["algorithm"], forward_loop=forward_loop) │
│ 115 │ return model │
│ 116 │
│ 117 │
│ │
│ in ammo.torch.quantization.model_calib.calibrate:60 │
│ │
│ in ammo.torch.quantization.model_calib.awq:182 │
│ │
│ /usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py:115 in decorate_context │
│ │
│ 112 │ @functools.wraps(func) │
│ 113 │ def decorate_context(*args, kwargs): │
│ 114 │ │ with ctx_factory(): │
│ ❱ 115 │ │ │ return func(*args, *kwargs) │
│ 116 │ │
│ 117 │ return decorate_context │
│ 118 │
│ │
│ in ammo.torch.quantization.model_calib.awq_lite:294 │
│ │
│ /usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/quantized/ammo.py:52 in │
│ calibrate_loop │
│ │
│ 49 │ │ """Adjusts weights and scaling factors based on selected algorithms.""" │
│ 50 │ │ for idx, data in enumerate(calib_dataloader): │
│ 51 │ │ │ logger.debug(f"Calibrating batch {idx}") │
│ ❱ 52 │ │ │ model(data) │
│ 53 │ │
│ 54 │ logger.debug("Starting quantization...") │
│ 55 │ atq.quantize(model, quant_cfg, forward_loop=calibrate_loop) │
│ │
│ /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1505 in _wrapped_call_impl │
│ │
│ 1502 │ │ if self._compiled_call_impl is not None: │
│ 1503 │ │ │ return self._compiled_call_impl(args, kwargs) # type: ignore[misc] │
│ 1504 │ │ else: │
│ ❱ 1505 │ │ │ return self._call_impl(*args, kwargs) │
│ 1506 │ │
│ 1507 │ def _call_impl(self, *args, *kwargs): │
│ 1508 │ │ forward_call = (self._slow_forward if torch._C._get_tracing_state() else self.fo │
│ │
│ /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1514 in _call_impl │
│ │
│ 1511 │ │ if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks │
│ 1512 │ │ │ │ or _global_backward_pre_hooks or _global_backward_hooks │
│ 1513 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks): │
│ ❱ 1514 │ │ │ return forward_call(args, kwargs) │
│ 1515 │ │ # Do not call functions when jit is used │
│ 1516 │ │ full_backward_hooks, non_full_backward_hooks = [], [] │
│ 1517 │ │ backward_pre_hooks = [] │
│ │
│ /usr/local/lib/python3.10/dist-packages/accelerate/hooks.py:165 in new_forward │
│ │
│ 162 │ │ │ with torch.no_grad(): │
│ 163 │ │ │ │ output = old_forward(*args, kwargs) │
│ 164 │ │ else: │
│ ❱ 165 │ │ │ output = old_forward(*args, *kwargs) │
│ 166 │ │ return module._hf_hook.post_forward(module, output) │
│ 167 │ │
│ 168 │ module.forward = new_forward │
│ │
│ /usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py:806 in │
│ forward │
│ │
│ 803 │ │ return_dict = return_dict if return_dict is not None else self.config.use_return │
│ 804 │ │ │
│ 805 │ │ # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn) │
│ ❱ 806 │ │ outputs = self.model( │
│ 807 │ │ │ input_ids=input_ids, │
│ 808 │ │ │ attention_mask=attention_mask, │
│ 809 │ │ │ position_ids=position_ids, │
│ │
│ /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1505 in _wrapped_call_impl │
│ │
│ 1502 │ │ if self._compiled_call_impl is not None: │
│ 1503 │ │ │ return self._compiled_call_impl(args, kwargs) # type: ignore[misc] │
│ 1504 │ │ else: │
│ ❱ 1505 │ │ │ return self._call_impl(*args, kwargs) │
│ 1506 │ │
│ 1507 │ def _call_impl(self, *args, *kwargs): │
│ 1508 │ │ forward_call = (self._slow_forward if torch._C._get_tracing_state() else self.fo │
│ │
│ /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1514 in _call_impl │
│ │
│ 1511 │ │ if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks │
│ 1512 │ │ │ │ or _global_backward_pre_hooks or _global_backward_hooks │
│ 1513 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks): │
│ ❱ 1514 │ │ │ return forward_call(args, kwargs) │
│ 1515 │ │ # Do not call functions when jit is used │
│ 1516 │ │ full_backward_hooks, non_full_backward_hooks = [], [] │
│ 1517 │ │ backward_pre_hooks = [] │
│ │
│ /usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py:693 in │
│ forward │
│ │
│ 690 │ │ │ │ │ None, │
│ 691 │ │ │ │ ) │
│ 692 │ │ │ else: │
│ ❱ 693 │ │ │ │ layer_outputs = decoder_layer( │
│ 694 │ │ │ │ │ hidden_states, │
│ 695 │ │ │ │ │ attention_mask=attention_mask, │
│ 696 │ │ │ │ │ position_ids=position_ids, │
│ │
│ /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1505 in _wrapped_call_impl │
│ │
│ 1502 │ │ if self._compiled_call_impl is not None: │
│ 1503 │ │ │ return self._compiled_call_impl(*args, kwargs) # type: ignore[misc] │
│ 1504 │ │ else: │
│ ❱ 1505 │ │ │ return self._call_impl(*args, *kwargs) │
│ 1506 │ │
│ 1507 │ def _call_impl(self, args, kwargs): │
│ 1508 │ │ forward_call = (self._slow_forward if torch._C._get_tracing_state() else self.fo │
│ │
│ /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1514 in _call_impl │
│ │
│ 1511 │ │ if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks │
│ 1512 │ │ │ │ or _global_backward_pre_hooks or _global_backward_hooks │
│ 1513 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks): │
│ ❱ 1514 │ │ │ return forward_call(*args, kwargs) │
│ 1515 │ │ # Do not call functions when jit is used │
│ 1516 │ │ full_backward_hooks, non_full_backward_hooks = [], [] │
│ 1517 │ │ backward_pre_hooks = [] │
│ │
│ /usr/local/lib/python3.10/dist-packages/accelerate/hooks.py:165 in new_forward │
│ │
│ 162 │ │ │ with torch.no_grad(): │
│ 163 │ │ │ │ output = old_forward(*args, *kwargs) │
│ 164 │ │ else: │
│ ❱ 165 │ │ │ output = old_forward(args, kwargs) │
│ 166 │ │ return module._hf_hook.post_forward(module, output) │
│ 167 │ │
│ 168 │ module.forward = new_forward │
│ │
│ /usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py:408 in │
│ forward │
│ │
│ 405 │ │ hidden_states = self.input_layernorm(hidden_states) │
│ 406 │ │ │
│ 407 │ │ # Self Attention │
│ ❱ 408 │ │ hidden_states, self_attn_weights, present_key_value = self.self_attn( │
│ 409 │ │ │ hidden_states=hidden_states, │
│ 410 │ │ │ attention_mask=attention_mask, │
│ 411 │ │ │ position_ids=position_ids, │
│ │
│ /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1505 in _wrapped_call_impl │
│ │
│ 1502 │ │ if self._compiled_call_impl is not None: │
│ 1503 │ │ │ return self._compiled_call_impl(*args, kwargs) # type: ignore[misc] │
│ 1504 │ │ else: │
│ ❱ 1505 │ │ │ return self._call_impl(*args, *kwargs) │
│ 1506 │ │
│ 1507 │ def _call_impl(self, args, kwargs): │
│ 1508 │ │ forward_call = (self._slow_forward if torch._C._get_tracing_state() else self.fo │
│ │
│ /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1514 in _call_impl │
│ │
│ 1511 │ │ if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks │
│ 1512 │ │ │ │ or _global_backward_pre_hooks or _global_backward_hooks │
│ 1513 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks): │
│ ❱ 1514 │ │ │ return forward_call(*args, kwargs) │
│ 1515 │ │ # Do not call functions when jit is used │
│ 1516 │ │ full_backward_hooks, non_full_backward_hooks = [], [] │
│ 1517 │ │ backward_pre_hooks = [] │
│ │
│ /usr/local/lib/python3.10/dist-packages/accelerate/hooks.py:165 in new_forward │
│ │
│ 162 │ │ │ with torch.no_grad(): │
│ 163 │ │ │ │ output = old_forward(*args, *kwargs) │
│ 164 │ │ else: │
│ ❱ 165 │ │ │ output = old_forward(args, kwargs) │
│ 166 │ │ return module._hf_hook.post_forward(module, output) │
│ 167 │ │
│ 168 │ module.forward = new_forward │
│ │
│ /usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py:305 in │
│ forward │
│ │
│ 302 │ │ │ value_states = torch.cat(value_states, dim=-1) │
│ 303 │ │ │
│ 304 │ │ else: │
│ ❱ 305 │ │ │ query_states = self.q_proj(hidden_states) │
│ 306 │ │ │ key_states = self.k_proj(hidden_states) │
│ 307 │ │ │ value_states = self.v_proj(hidden_states) │
│ 308 │
│ │
│ /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1505 in _wrapped_call_impl │
│ │
│ 1502 │ │ if self._compiled_call_impl is not None: │
│ 1503 │ │ │ return self._compiled_call_impl(*args, kwargs) # type: ignore[misc] │
│ 1504 │ │ else: │
│ ❱ 1505 │ │ │ return self._call_impl(*args, *kwargs) │
│ 1506 │ │
│ 1507 │ def _call_impl(self, args, kwargs): │
│ 1508 │ │ forward_call = (self._slow_forward if torch._C._get_tracing_state() else self.fo │
│ │
│ /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1514 in _call_impl │
│ │
│ 1511 │ │ if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks │
│ 1512 │ │ │ │ or _global_backward_pre_hooks or _global_backward_hooks │
│ 1513 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks): │
│ ❱ 1514 │ │ │ return forward_call(*args, kwargs) │
│ 1515 │ │ # Do not call functions when jit is used │
│ 1516 │ │ full_backward_hooks, non_full_backward_hooks = [], [] │
│ 1517 │ │ backward_pre_hooks = [] │
│ │
│ in ammo.torch.quantization.model_calib.awq_lite.forward:256 │
│ │
│ /usr/local/lib/python3.10/dist-packages/ammo/torch/quantization/nn/modules/quant_module.py:58 in │
│ forward │
│ │
│ 55 │ │ │
│ 56 │ │ def forward(self, input, *args, *kwargs): │
│ 57 │ │ │ self.dict["weight"] = self.weight_quantizer(self.weight) │
│ ❱ 58 │ │ │ output = self._original_forward(self.input_quantizer(input), args, kwargs │
│ 59 │ │ │ del self.dict["weight"] │
│ 60 │ │ │ if isinstance(output, tuple): │
│ 61 │ │ │ │ return (self.output_quantizer(output[0]), output[1:]) │
│ │
│ /usr/local/lib/python3.10/dist-packages/accelerate/hooks.py:165 in new_forward │
│ │
│ 162 │ │ │ with torch.no_grad(): │
│ 163 │ │ │ │ output = old_forward(args, *kwargs) │
│ 164 │ │ else: │
│ ❱ 165 │ │ │ output = old_forward(args, **kwargs) │
│ 166 │ │ return module._hf_hook.post_forward(module, output) │
│ 167 │ │
│ 168 │ module.forward = newforward │
│ │
│ /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114 in forward │
│ │
│ 111 │ │ │ init.uniform(self.bias, -bound, bound) │
│ 112 │ │
│ 113 │ def forward(self, input: Tensor) -> Tensor: │
│ ❱ 114 │ │ return F.linear(input, self.weight, self.bias) │
│ 115 │ │
│ 116 │ def extra_repr(self) -> str: │
│ 117 │ │ return 'in_features={}, out_features={}, bias={}'.format( │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling
cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16F, lda, b, CUDA_R_16F, ldb, &fbeta, c, CUDA_R_16F, ldc, CUDA_R_32F, CUBLAS_GEMM_DEFAULT_TENSOR_OP)