Closed YixuanSeanZhou closed 3 weeks ago
Do you have Cuda? And if so what version? And what is your torch version?
Hi @kevalmorabia97 , thanks for chiming in.
I do have cuda=12.2, see nvidia-smi below. Also I have torch == 2.4.0+cu124
>>> import torch
>>> torch.__version__
'2.4.0+cu124'
>>>
nvidia-smi
Wed Oct 30 10:54:54 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01 Driver Version: 535.183.01 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA RTX A4000 Off | 00000000:65:00.0 Off | Off |
| 41% 47C P2 32W / 140W | 241MiB / 16376MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 2734 G /usr/lib/xorg/Xorg 64MiB |
| 0 N/A N/A 3611 G /usr/bin/gnome-shell 5MiB |
| 0 N/A N/A 473267 C ...ig_python/bin/python_relinked_glibc 160MiB |
+---------------------------------------------------------------------------------------+
Great. Can you run this command and share the entire output of it:
python -c "import modelopt.torch.quantization.extensions as ext; ext.precompile()"
Hi @kevalmorabia97, thank you so much for getting back so fast! I got the following attribute error running the cmd you suggested
python -c "import modelopt.torch.quantization.extensions as ext; ext.precompile()"
/home/yixzhou/miniconda3/envs/fixed_keras_modelopt_trt/lib/python3.10/site-packages/modelopt/torch/quantization/tensor_quant.py:92: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
scaled_e4m3_abstract = torch.library.impl_abstract("trt::quantize_fp8")(
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/home/yixzhou/miniconda3/envs/fixed_keras_modelopt_trt/lib/python3.10/site-packages/modelopt/torch/quantization/extensions.py", line 63, in __getattr__
raise AttributeError(f"module {__name__} has no attribute {name}")
AttributeError: module modelopt.torch.quantization.extensions has no attribute precompile
/home/yixzhou/miniconda3/envs/fixed_keras_modelopt_trt/lib/python3.10/tempfile.py:869: ResourceWarning: Implicitly cleaning up <TemporaryDirectory '/tmp/tmp1o0psyhv'>
_warnings.warn(warn_message, ResourceWarning)
My bad, I see that you are using older 0.15.1 version instead of latest 0.19.0. Can you instead run this command
python -c "import modelopt.torch.quantization.extensions as ext; print(ext.cuda_ext); print(ext.cuda_ext_fp8)"
Seems like both were unavaliable to load 😢
python -c "import modelopt.torch.quantization.extensions as ext; print(ext.cuda_ext); print(ext.cuda_ext_fp8)"
/home/yixzhou/miniconda3/envs/fixed_keras_modelopt_trt/lib/python3.10/site-packages/modelopt/torch/quantization/tensor_quant.py:92: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
scaled_e4m3_abstract = torch.library.impl_abstract("trt::quantize_fp8")(
Loading extension modelopt_cuda_ext...
/home/yixzhou/miniconda3/envs/fixed_keras_modelopt_trt/lib/python3.10/site-packages/modelopt/torch/utils/cpp_extension.py:58: UserWarning: CUDA_HOME environment variable is not set. Please set it to your CUDA install root.
Unable to load extension modelopt_cuda_ext and falling back to CPU version.
warnings.warn(
None
Loading extension modelopt_cuda_ext_fp8...
/home/yixzhou/miniconda3/envs/fixed_keras_modelopt_trt/lib/python3.10/site-packages/modelopt/torch/utils/cpp_extension.py:58: UserWarning: CUDA extension for FP8 quantization could not be built and loaded, FP8 simulated quantization will not be available.
warnings.warn(
None
/home/yixzhou/miniconda3/envs/fixed_keras_modelopt_trt/lib/python3.10/tempfile.py:869: ResourceWarning: Implicitly cleaning up <TemporaryDirectory '/tmp/tmprq_1onoo'>
_warnings.warn(warn_message, ResourceWarning)
It seems CUDA_HOME environment variable is not set. Please set it to your CUDA install root.
Can you look into this please? Seems like unrelated to Model Optimizer
Ah, @kevalmorabia97 thank you so much for this pointer! After pointing CUDA_HOME to the correct location (also i need to fix a ninja
issue somehow on my Ubantu), i seems to be able to run the correct kernels and also get FP8
working!
I will close this now. Thank you again for your help! ❤
Got the same error and the same output of runing python -c "import modelopt.torch.quantization.extensions as ext; ext.precompile()
.
But I have already installed ninja
and set the CUDA_PATH as following.
Can someone help me with this? Thank you in advance.
@Nikki-Gu seems like you are on Windows. Can you instead try WSL? Pytorch features are experimental on Windows.
That being said, can you also share the full output error? It might be a common error you can search online to get some solutions.
Yes, it's on Windows. The full output error is as follows.
D:\Games\TensorRT_LLM_Python_Environment_Win_Py31011_v0_14_0_env\lib\site-packages\modelopt\torch\utils\cpp_extension.py:58: UserWarning: CUDA extension for FP8 quantization could not be built and loaded, FP8 simulated quantization will not be available.
warnings.warn(
Traceback (most recent call last):
File "D:\LLM\TensorRT-LLM-v14\examples\quantization\quantize.py", line 139, in <module>
quantize_and_export(
File "D:\Games\TensorRT_LLM_Python_Environment_Win_Py31011_v0_14_0_env\lib\site-packages\tensorrt_llm\quantization\quantize_by_modelopt.py", line 539, in quantize_and_export
model = quantize_model(model, quant_cfg, calib_dataloader, batch_size,
File "D:\Games\TensorRT_LLM_Python_Environment_Win_Py31011_v0_14_0_env\lib\site-packages\tensorrt_llm\quantization\quantize_by_modelopt.py", line 349, in quantize_model
atq.quantize(model, quant_cfg, forward_loop=calibrate_loop)
File "D:\Games\TensorRT_LLM_Python_Environment_Win_Py31011_v0_14_0_env\lib\site-packages\modelopt\torch\quantization\model_quant.py", line 131, in quantize
return calibrate(model, config["algorithm"], forward_loop=forward_loop)
File "modelopt\\torch\\quantization\\model_calib.py", line 102, in modelopt.torch.quantization.model_calib.calibrate
File "modelopt\\torch\\quantization\\model_calib.py", line 529, in modelopt.torch.quantization.model_calib.awq
File "modelopt\\torch\\quantization\\model_calib.py", line 531, in modelopt.torch.quantization.model_calib.awq
File "D:\Games\TensorRT_LLM_Python_Environment_Win_Py31011_v0_14_0_env\lib\site-packages\torch\utils\_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "modelopt\\torch\\quantization\\model_calib.py", line 703, in modelopt.torch.quantization.model_calib.awq_lite
File "modelopt\\torch\\quantization\\model_calib.py", line 82, in modelopt.torch.quantization.model_calib.calibrate.forward_loop
File "D:\Games\TensorRT_LLM_Python_Environment_Win_Py31011_v0_14_0_env\lib\site-packages\tensorrt_llm\quantization\quantize_by_modelopt.py", line 318, in calibrate_loop
model(data)
File "D:\Games\TensorRT_LLM_Python_Environment_Win_Py31011_v0_14_0_env\lib\site-packages\torch\nn\modules\module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "D:\Games\TensorRT_LLM_Python_Environment_Win_Py31011_v0_14_0_env\lib\site-packages\torch\nn\modules\module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "D:\Games\TensorRT_LLM_Python_Environment_Win_Py31011_v0_14_0_env\lib\site-packages\accelerate\hooks.py", line 165, in new_forward
output = module._old_forward(*args, **kwargs)
File "D:\Games\TensorRT_LLM_Python_Environment_Win_Py31011_v0_14_0_env\lib\site-packages\transformers\models\qwen2\modeling_qwen2.py", line 1167, in forward
outputs = self.model(
File "D:\Games\TensorRT_LLM_Python_Environment_Win_Py31011_v0_14_0_env\lib\site-packages\torch\nn\modules\module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "D:\Games\TensorRT_LLM_Python_Environment_Win_Py31011_v0_14_0_env\lib\site-packages\torch\nn\modules\module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "D:\Games\TensorRT_LLM_Python_Environment_Win_Py31011_v0_14_0_env\lib\site-packages\transformers\models\qwen2\modeling_qwen2.py", line 976, in forward
layer_outputs = decoder_layer(
File "D:\Games\TensorRT_LLM_Python_Environment_Win_Py31011_v0_14_0_env\lib\site-packages\torch\nn\modules\module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "D:\Games\TensorRT_LLM_Python_Environment_Win_Py31011_v0_14_0_env\lib\site-packages\torch\nn\modules\module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "D:\Games\TensorRT_LLM_Python_Environment_Win_Py31011_v0_14_0_env\lib\site-packages\accelerate\hooks.py", line 165, in new_forward
output = module._old_forward(*args, **kwargs)
File "D:\Games\TensorRT_LLM_Python_Environment_Win_Py31011_v0_14_0_env\lib\site-packages\transformers\models\qwen2\modeling_qwen2.py", line 702, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "D:\Games\TensorRT_LLM_Python_Environment_Win_Py31011_v0_14_0_env\lib\site-packages\torch\nn\modules\module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "D:\Games\TensorRT_LLM_Python_Environment_Win_Py31011_v0_14_0_env\lib\site-packages\torch\nn\modules\module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "D:\Games\TensorRT_LLM_Python_Environment_Win_Py31011_v0_14_0_env\lib\site-packages\accelerate\hooks.py", line 165, in new_forward
output = module._old_forward(*args, **kwargs)
File "D:\Games\TensorRT_LLM_Python_Environment_Win_Py31011_v0_14_0_env\lib\site-packages\transformers\models\qwen2\modeling_qwen2.py", line 580, in forward
query_states = self.q_proj(hidden_states)
File "D:\Games\TensorRT_LLM_Python_Environment_Win_Py31011_v0_14_0_env\lib\site-packages\torch\nn\modules\module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "D:\Games\TensorRT_LLM_Python_Environment_Win_Py31011_v0_14_0_env\lib\site-packages\torch\nn\modules\module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "modelopt\\torch\\quantization\\model_calib.py", line 655, in modelopt.torch.quantization.model_calib.awq_lite.forward
File "D:\Games\TensorRT_LLM_Python_Environment_Win_Py31011_v0_14_0_env\lib\site-packages\accelerate\hooks.py", line 165, in new_forward
output = module._old_forward(*args, **kwargs)
File "D:\Games\TensorRT_LLM_Python_Environment_Win_Py31011_v0_14_0_env\lib\site-packages\modelopt\torch\quantization\nn\modules\quant_module.py", line 84, in forward
return super().forward(input, *args, **kwargs)
File "D:\Games\TensorRT_LLM_Python_Environment_Win_Py31011_v0_14_0_env\lib\site-packages\modelopt\torch\quantization\nn\modules\quant_module.py", line 42, in forward
return self.output_quantizer(output)
File "D:\Games\TensorRT_LLM_Python_Environment_Win_Py31011_v0_14_0_env\lib\site-packages\torch\nn\modules\module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "D:\Games\TensorRT_LLM_Python_Environment_Win_Py31011_v0_14_0_env\lib\site-packages\torch\nn\modules\module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "D:\Games\TensorRT_LLM_Python_Environment_Win_Py31011_v0_14_0_env\lib\site-packages\modelopt\torch\quantization\nn\modules\tensor_quantizer.py", line 667, in forward
outputs = self._quant_forward(inputs)
File "D:\Games\TensorRT_LLM_Python_Environment_Win_Py31011_v0_14_0_env\lib\site-packages\modelopt\torch\quantization\nn\modules\tensor_quantizer.py", line 440, in _quant_forward
outputs = scaled_e4m3(
File "D:\Games\TensorRT_LLM_Python_Environment_Win_Py31011_v0_14_0_env\lib\site-packages\torch\autograd\function.py", line 574, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
File "D:\Games\TensorRT_LLM_Python_Environment_Win_Py31011_v0_14_0_env\lib\site-packages\modelopt\torch\quantization\tensor_quant.py", line 381, in forward
outputs = quantize_op(
File "D:\Games\TensorRT_LLM_Python_Environment_Win_Py31011_v0_14_0_env\lib\site-packages\torch\_ops.py", line 1061, in __call__
return self_._op(*args, **(kwargs or {}))
File "D:\Games\TensorRT_LLM_Python_Environment_Win_Py31011_v0_14_0_env\lib\site-packages\modelopt\torch\quantization\tensor_quant.py", line 123, in _quantize_impl
return scaled_e4m3_impl(inputs=inputs, amax=amax)
File "D:\Games\TensorRT_LLM_Python_Environment_Win_Py31011_v0_14_0_env\lib\site-packages\modelopt\torch\quantization\tensor_quant.py", line 50, in scaled_e4m3_impl
cuda_ext_fp8 is not None
AssertionError: cuda_ext_fp8 could not be imported. E4M3 quantization requires CUDA and cuda_ext_fp8.
Can you run just python -c "import modelopt.torch.quantization.extensions as ext; ext.precompile()
and share the output? It should print a warning something like <torch error stack trace> Unable to load extension modelopt_cuda_ext_fp8 and falling back to CPU version.
The output is:
Loading extension modelopt_cuda_ext...
D:\Games\TensorRT_LLM_Python_Environment_Win_Py31011_v0_14_0_env\lib\site-packages\modelopt\torch\utils\cpp_extension.py:58: UserWarning: Ninja is required to load C++ extensions
Unable to load extension modelopt_cuda_ext and falling back to CPU version.
warnings.warn(
None
Loading extension modelopt_cuda_ext_fp8...
D:\Games\TensorRT_LLM_Python_Environment_Win_Py31011_v0_14_0_env\lib\site-packages\modelopt\torch\utils\cpp_extension.py:58: UserWarning: CUDA extension for FP8 quantization could not be built and loaded, FP8 simulated quantization will not be available.
warnings.warn(
None
Which version of nvidia-modelopt
are you using? Ideally there should be more details
The version is 0.17.0.
Can you also try upgrading to 0.19.0 version? What does python -c "import torch; print(torch.version.cuda); print(torch.cuda.is_available())"
show?
It shows 12.4 and True. Sure, I will have a try.
Version 0.19.0 didn't work for this case
But does it show more error details?
No, still the same errors as version 0.17.0
I'd suggest you give it a try on Linux / WSL then
Hi modelopt developers
I was installing modelop from the released files on the PyPI registry (specifically need this for my use case). In particular i am using this version: nvidia_modelopt-0.15.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl . However, I noticed that when I did torch FP8 quantization, it gave me the following error:
cuda_ext_fp8 could not be imported. E4M3 quantization requires CUDA and cuda_ext_fp8
.Could you please advice on how / where I can install the extensions and make them available to the library when loading?
Thank you in advance,