NVIDIA / TensorRT-Model-Optimizer

TensorRT Model Optimizer is a unified library of state-of-the-art model optimization techniques such as quantization and sparsity. It compresses deep learning models for downstream deployment frameworks like TensorRT-LLM or TensorRT to optimize inference speed on NVIDIA GPUs.
https://nvidia.github.io/TensorRT-Model-Optimizer
Other
391 stars 23 forks source link

Error when quantize Phi-3 to fp8, AssertionError: <class 'transformers.pytorch_utils.Conv1D'> already registered! #24

Open Ross-Fan opened 2 months ago

Ross-Fan commented 2 months ago

within the docker (IMAGE: nvidia/cuda:12.1.0-devel-ubuntu22.04) GPU: A100 40GB TensorRT-LLM version: 0.10.0 flash-attn 2.5.9.post1

I quantize the phi3 model(phi-3-medium-128k-instrcut/), with the following cmd: scripts/huggingface_example.sh --type phi --model $HF_PATH --quant fp8 --tp 1

The error occur when "python3 hf_ptq.py " in procedure /usr/local/lib/python3.10/dist-packages/modelopt/torch/quantization/plugins/init.py:37: UserWarning: Failed to import diffusers plugin due to: RuntimeError('Failed to import transformers.models.falcon.modeling_falcon because of the following error (look up to see its traceback):\n/usr/local/lib/python3.10/dist-packages/flash_attn_2_cuda.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN3c104cuda14ExchangeDeviceEa') warnings.warn(f"Failed to import diffusers plugin due to: {repr(e)}") /usr/local/lib/python3.10/dist-packages/modelopt/torch/quantization/plugins/init.py:44: UserWarning: Failed to import huggingface plugin due to: AssertionError("<class 'transformers.pytorch_utils.Conv1D'> already registered!") warnings.warn(f"Failed to import huggingface plugin due to: {repr(e)}") Traceback (most recent call last): File "/home/tensorrt-model-optimizer/TensorRT-Model-Optimizer/llm_ptq/hf_ptq.py", line 32, in import modelopt.torch.quantization as mtq File "/usr/local/lib/python3.10/dist-packages/modelopt/torch/init.py", line 13, in from . import opt, quantization, sparsity, utils # noqa: E402 File "/usr/local/lib/python3.10/dist-packages/modelopt/torch/quantization/init.py", line 13, in from . import mode, plugins File "/usr/local/lib/python3.10/dist-packages/modelopt/torch/quantization/mode.py", line 25, in from .conversion import ( File "/usr/local/lib/python3.10/dist-packages/modelopt/torch/quantization/conversion.py", line 23, in from .plugins.custom import register_custom_model_plugins_on_the_fly File "/usr/local/lib/python3.10/dist-packages/modelopt/torch/quantization/plugins/custom.py", line 25, in from .huggingface import register_falcon_linears_on_the_fly File "/usr/local/lib/python3.10/dist-packages/modelopt/torch/quantization/plugins/huggingface.py", line 27, in class _QuantConv1D(_QuantLinear): File "/usr/local/lib/python3.10/dist-packages/modelopt/torch/opt/dynamic.py", line 1001, in decorator assert nncls not in self._registry, f"{nncls} already registered!" AssertionError: <class 'transformers.pytorch_utils.Conv1D'> already registered!

Any solution on this Error?

cjluo-omniml commented 2 months ago

The quantization error should be gone after the following release update, which will be out soon. With the next release, you should be able to quantize and generate the TRT LLM checkpoint. You may want to follow up with TRT LLM about when the Phi3 medium is supported.

Ross-Fan commented 2 months ago

The quantization error should be gone after the following release update, which will be out soon. With the next release, you should be able to quantize and generate the TRT LLM checkpoint. You may want to follow up with TRT LLM about when the Phi3 medium is supported.

Thanks for your answer. As mentioned above, I checked the support model of trt-llm, the phi3-medium is not on the support list, so when would the trt-llm support phi-3-medium-128k-instruct?

cjluo-omniml commented 2 months ago

It might be supported soon in the main branch and the following TRT LLM official release