NVIDIA / TensorRT-Model-Optimizer

TensorRT Model Optimizer is a unified library of state-of-the-art model optimization techniques such as quantization, pruning, distillation, etc. It compresses deep learning models for downstream deployment frameworks like TensorRT-LLM or TensorRT to optimize inference speed on NVIDIA GPUs.
https://nvidia.github.io/TensorRT-Model-Optimizer
Other
575 stars 43 forks source link

Fail to load `cuda_ext_fp8` and `cuda_ext` when installed with pip from whl #98

Closed YixuanSeanZhou closed 3 weeks ago

YixuanSeanZhou commented 3 weeks ago

Hi modelopt developers

I was installing modelop from the released files on the PyPI registry (specifically need this for my use case). In particular i am using this version: nvidia_modelopt-0.15.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl . However, I noticed that when I did torch FP8 quantization, it gave me the following error:

cuda_ext_fp8 could not be imported. E4M3 quantization requires CUDA and cuda_ext_fp8.

Could you please advice on how / where I can install the extensions and make them available to the library when loading?

Thank you in advance,

kevalmorabia97 commented 3 weeks ago

Do you have Cuda? And if so what version? And what is your torch version?

YixuanSeanZhou commented 3 weeks ago

Hi @kevalmorabia97 , thanks for chiming in.

I do have cuda=12.2, see nvidia-smi below. Also I have torch == 2.4.0+cu124

>>> import torch
>>> torch.__version__
'2.4.0+cu124'
>>> 
nvidia-smi
Wed Oct 30 10:54:54 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX A4000               Off | 00000000:65:00.0 Off |                  Off |
| 41%   47C    P2              32W / 140W |    241MiB / 16376MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      2734      G   /usr/lib/xorg/Xorg                           64MiB |
|    0   N/A  N/A      3611      G   /usr/bin/gnome-shell                          5MiB |
|    0   N/A  N/A    473267      C   ...ig_python/bin/python_relinked_glibc      160MiB |
+---------------------------------------------------------------------------------------+
kevalmorabia97 commented 3 weeks ago

Great. Can you run this command and share the entire output of it:

python -c "import modelopt.torch.quantization.extensions as ext; ext.precompile()"
YixuanSeanZhou commented 3 weeks ago

Hi @kevalmorabia97, thank you so much for getting back so fast! I got the following attribute error running the cmd you suggested

python -c "import modelopt.torch.quantization.extensions as ext; ext.precompile()"
/home/yixzhou/miniconda3/envs/fixed_keras_modelopt_trt/lib/python3.10/site-packages/modelopt/torch/quantization/tensor_quant.py:92: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
  scaled_e4m3_abstract = torch.library.impl_abstract("trt::quantize_fp8")(
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/yixzhou/miniconda3/envs/fixed_keras_modelopt_trt/lib/python3.10/site-packages/modelopt/torch/quantization/extensions.py", line 63, in __getattr__
    raise AttributeError(f"module {__name__} has no attribute {name}")
AttributeError: module modelopt.torch.quantization.extensions has no attribute precompile
/home/yixzhou/miniconda3/envs/fixed_keras_modelopt_trt/lib/python3.10/tempfile.py:869: ResourceWarning: Implicitly cleaning up <TemporaryDirectory '/tmp/tmp1o0psyhv'>
  _warnings.warn(warn_message, ResourceWarning)
kevalmorabia97 commented 3 weeks ago

My bad, I see that you are using older 0.15.1 version instead of latest 0.19.0. Can you instead run this command

python -c "import modelopt.torch.quantization.extensions as ext; print(ext.cuda_ext); print(ext.cuda_ext_fp8)"
YixuanSeanZhou commented 3 weeks ago

Seems like both were unavaliable to load 😢

python -c "import modelopt.torch.quantization.extensions as ext; print(ext.cuda_ext); print(ext.cuda_ext_fp8)"
/home/yixzhou/miniconda3/envs/fixed_keras_modelopt_trt/lib/python3.10/site-packages/modelopt/torch/quantization/tensor_quant.py:92: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
  scaled_e4m3_abstract = torch.library.impl_abstract("trt::quantize_fp8")(
Loading extension modelopt_cuda_ext...
/home/yixzhou/miniconda3/envs/fixed_keras_modelopt_trt/lib/python3.10/site-packages/modelopt/torch/utils/cpp_extension.py:58: UserWarning: CUDA_HOME environment variable is not set. Please set it to your CUDA install root.
Unable to load extension modelopt_cuda_ext and falling back to CPU version.
  warnings.warn(
None
Loading extension modelopt_cuda_ext_fp8...
/home/yixzhou/miniconda3/envs/fixed_keras_modelopt_trt/lib/python3.10/site-packages/modelopt/torch/utils/cpp_extension.py:58: UserWarning: CUDA extension for FP8 quantization could not be built and loaded, FP8 simulated quantization will not be available.
  warnings.warn(
None
/home/yixzhou/miniconda3/envs/fixed_keras_modelopt_trt/lib/python3.10/tempfile.py:869: ResourceWarning: Implicitly cleaning up <TemporaryDirectory '/tmp/tmprq_1onoo'>
  _warnings.warn(warn_message, ResourceWarning)
kevalmorabia97 commented 3 weeks ago

It seems CUDA_HOME environment variable is not set. Please set it to your CUDA install root. Can you look into this please? Seems like unrelated to Model Optimizer

YixuanSeanZhou commented 3 weeks ago

Ah, @kevalmorabia97 thank you so much for this pointer! After pointing CUDA_HOME to the correct location (also i need to fix a ninja issue somehow on my Ubantu), i seems to be able to run the correct kernels and also get FP8 working!

I will close this now. Thank you again for your help! ❤

Nikki-Gu commented 1 day ago

Got the same error and the same output of runing python -c "import modelopt.torch.quantization.extensions as ext; ext.precompile(). But I have already installed ninja and set the CUDA_PATH as following.Image

Can someone help me with this? Thank you in advance.

kevalmorabia97 commented 1 day ago

@Nikki-Gu seems like you are on Windows. Can you instead try WSL? Pytorch features are experimental on Windows.

That being said, can you also share the full output error? It might be a common error you can search online to get some solutions.

Nikki-Gu commented 1 day ago

Yes, it's on Windows. The full output error is as follows.

D:\Games\TensorRT_LLM_Python_Environment_Win_Py31011_v0_14_0_env\lib\site-packages\modelopt\torch\utils\cpp_extension.py:58: UserWarning: CUDA extension for FP8 quantization could not be built and loaded, FP8 simulated quantization will not be available.
  warnings.warn(
Traceback (most recent call last):
  File "D:\LLM\TensorRT-LLM-v14\examples\quantization\quantize.py", line 139, in <module>
    quantize_and_export(
  File "D:\Games\TensorRT_LLM_Python_Environment_Win_Py31011_v0_14_0_env\lib\site-packages\tensorrt_llm\quantization\quantize_by_modelopt.py", line 539, in quantize_and_export
    model = quantize_model(model, quant_cfg, calib_dataloader, batch_size,
  File "D:\Games\TensorRT_LLM_Python_Environment_Win_Py31011_v0_14_0_env\lib\site-packages\tensorrt_llm\quantization\quantize_by_modelopt.py", line 349, in quantize_model
    atq.quantize(model, quant_cfg, forward_loop=calibrate_loop)
  File "D:\Games\TensorRT_LLM_Python_Environment_Win_Py31011_v0_14_0_env\lib\site-packages\modelopt\torch\quantization\model_quant.py", line 131, in quantize
    return calibrate(model, config["algorithm"], forward_loop=forward_loop)
  File "modelopt\\torch\\quantization\\model_calib.py", line 102, in modelopt.torch.quantization.model_calib.calibrate
  File "modelopt\\torch\\quantization\\model_calib.py", line 529, in modelopt.torch.quantization.model_calib.awq
  File "modelopt\\torch\\quantization\\model_calib.py", line 531, in modelopt.torch.quantization.model_calib.awq
  File "D:\Games\TensorRT_LLM_Python_Environment_Win_Py31011_v0_14_0_env\lib\site-packages\torch\utils\_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "modelopt\\torch\\quantization\\model_calib.py", line 703, in modelopt.torch.quantization.model_calib.awq_lite
  File "modelopt\\torch\\quantization\\model_calib.py", line 82, in modelopt.torch.quantization.model_calib.calibrate.forward_loop
  File "D:\Games\TensorRT_LLM_Python_Environment_Win_Py31011_v0_14_0_env\lib\site-packages\tensorrt_llm\quantization\quantize_by_modelopt.py", line 318, in calibrate_loop
    model(data)
  File "D:\Games\TensorRT_LLM_Python_Environment_Win_Py31011_v0_14_0_env\lib\site-packages\torch\nn\modules\module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "D:\Games\TensorRT_LLM_Python_Environment_Win_Py31011_v0_14_0_env\lib\site-packages\torch\nn\modules\module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\Games\TensorRT_LLM_Python_Environment_Win_Py31011_v0_14_0_env\lib\site-packages\accelerate\hooks.py", line 165, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "D:\Games\TensorRT_LLM_Python_Environment_Win_Py31011_v0_14_0_env\lib\site-packages\transformers\models\qwen2\modeling_qwen2.py", line 1167, in forward
    outputs = self.model(
  File "D:\Games\TensorRT_LLM_Python_Environment_Win_Py31011_v0_14_0_env\lib\site-packages\torch\nn\modules\module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "D:\Games\TensorRT_LLM_Python_Environment_Win_Py31011_v0_14_0_env\lib\site-packages\torch\nn\modules\module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\Games\TensorRT_LLM_Python_Environment_Win_Py31011_v0_14_0_env\lib\site-packages\transformers\models\qwen2\modeling_qwen2.py", line 976, in forward
    layer_outputs = decoder_layer(
  File "D:\Games\TensorRT_LLM_Python_Environment_Win_Py31011_v0_14_0_env\lib\site-packages\torch\nn\modules\module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "D:\Games\TensorRT_LLM_Python_Environment_Win_Py31011_v0_14_0_env\lib\site-packages\torch\nn\modules\module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\Games\TensorRT_LLM_Python_Environment_Win_Py31011_v0_14_0_env\lib\site-packages\accelerate\hooks.py", line 165, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "D:\Games\TensorRT_LLM_Python_Environment_Win_Py31011_v0_14_0_env\lib\site-packages\transformers\models\qwen2\modeling_qwen2.py", line 702, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "D:\Games\TensorRT_LLM_Python_Environment_Win_Py31011_v0_14_0_env\lib\site-packages\torch\nn\modules\module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "D:\Games\TensorRT_LLM_Python_Environment_Win_Py31011_v0_14_0_env\lib\site-packages\torch\nn\modules\module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\Games\TensorRT_LLM_Python_Environment_Win_Py31011_v0_14_0_env\lib\site-packages\accelerate\hooks.py", line 165, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "D:\Games\TensorRT_LLM_Python_Environment_Win_Py31011_v0_14_0_env\lib\site-packages\transformers\models\qwen2\modeling_qwen2.py", line 580, in forward
    query_states = self.q_proj(hidden_states)
  File "D:\Games\TensorRT_LLM_Python_Environment_Win_Py31011_v0_14_0_env\lib\site-packages\torch\nn\modules\module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "D:\Games\TensorRT_LLM_Python_Environment_Win_Py31011_v0_14_0_env\lib\site-packages\torch\nn\modules\module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "modelopt\\torch\\quantization\\model_calib.py", line 655, in modelopt.torch.quantization.model_calib.awq_lite.forward
  File "D:\Games\TensorRT_LLM_Python_Environment_Win_Py31011_v0_14_0_env\lib\site-packages\accelerate\hooks.py", line 165, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "D:\Games\TensorRT_LLM_Python_Environment_Win_Py31011_v0_14_0_env\lib\site-packages\modelopt\torch\quantization\nn\modules\quant_module.py", line 84, in forward
    return super().forward(input, *args, **kwargs)
  File "D:\Games\TensorRT_LLM_Python_Environment_Win_Py31011_v0_14_0_env\lib\site-packages\modelopt\torch\quantization\nn\modules\quant_module.py", line 42, in forward
    return self.output_quantizer(output)
  File "D:\Games\TensorRT_LLM_Python_Environment_Win_Py31011_v0_14_0_env\lib\site-packages\torch\nn\modules\module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "D:\Games\TensorRT_LLM_Python_Environment_Win_Py31011_v0_14_0_env\lib\site-packages\torch\nn\modules\module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\Games\TensorRT_LLM_Python_Environment_Win_Py31011_v0_14_0_env\lib\site-packages\modelopt\torch\quantization\nn\modules\tensor_quantizer.py", line 667, in forward
    outputs = self._quant_forward(inputs)
  File "D:\Games\TensorRT_LLM_Python_Environment_Win_Py31011_v0_14_0_env\lib\site-packages\modelopt\torch\quantization\nn\modules\tensor_quantizer.py", line 440, in _quant_forward
    outputs = scaled_e4m3(
  File "D:\Games\TensorRT_LLM_Python_Environment_Win_Py31011_v0_14_0_env\lib\site-packages\torch\autograd\function.py", line 574, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "D:\Games\TensorRT_LLM_Python_Environment_Win_Py31011_v0_14_0_env\lib\site-packages\modelopt\torch\quantization\tensor_quant.py", line 381, in forward
    outputs = quantize_op(
  File "D:\Games\TensorRT_LLM_Python_Environment_Win_Py31011_v0_14_0_env\lib\site-packages\torch\_ops.py", line 1061, in __call__
    return self_._op(*args, **(kwargs or {}))
  File "D:\Games\TensorRT_LLM_Python_Environment_Win_Py31011_v0_14_0_env\lib\site-packages\modelopt\torch\quantization\tensor_quant.py", line 123, in _quantize_impl
    return scaled_e4m3_impl(inputs=inputs, amax=amax)
  File "D:\Games\TensorRT_LLM_Python_Environment_Win_Py31011_v0_14_0_env\lib\site-packages\modelopt\torch\quantization\tensor_quant.py", line 50, in scaled_e4m3_impl
    cuda_ext_fp8 is not None
AssertionError: cuda_ext_fp8 could not be imported. E4M3 quantization requires CUDA and cuda_ext_fp8.
kevalmorabia97 commented 1 day ago

Can you run just python -c "import modelopt.torch.quantization.extensions as ext; ext.precompile() and share the output? It should print a warning something like <torch error stack trace> Unable to load extension modelopt_cuda_ext_fp8 and falling back to CPU version.

Nikki-Gu commented 1 day ago

The output is:

Loading extension modelopt_cuda_ext...
D:\Games\TensorRT_LLM_Python_Environment_Win_Py31011_v0_14_0_env\lib\site-packages\modelopt\torch\utils\cpp_extension.py:58: UserWarning: Ninja is required to load C++ extensions
Unable to load extension modelopt_cuda_ext and falling back to CPU version.
  warnings.warn(
None
Loading extension modelopt_cuda_ext_fp8...
D:\Games\TensorRT_LLM_Python_Environment_Win_Py31011_v0_14_0_env\lib\site-packages\modelopt\torch\utils\cpp_extension.py:58: UserWarning: CUDA extension for FP8 quantization could not be built and loaded, FP8 simulated quantization will not be available.
  warnings.warn(
None
kevalmorabia97 commented 1 day ago

Which version of nvidia-modelopt are you using? Ideally there should be more details

Nikki-Gu commented 1 day ago

The version is 0.17.0.

kevalmorabia97 commented 1 day ago

Can you also try upgrading to 0.19.0 version? What does python -c "import torch; print(torch.version.cuda); print(torch.cuda.is_available())" show?

Nikki-Gu commented 1 day ago

It shows 12.4 and True. Sure, I will have a try.

Nikki-Gu commented 1 day ago

Version 0.19.0 didn't work for this case

kevalmorabia97 commented 1 day ago

But does it show more error details?

Nikki-Gu commented 1 day ago

No, still the same errors as version 0.17.0

kevalmorabia97 commented 1 day ago

I'd suggest you give it a try on Linux / WSL then