TensorRT Model Optimizer is a unified library of state-of-the-art model optimization techniques such as quantization, pruning, distillation, etc. It compresses deep learning models for downstream deployment frameworks like TensorRT-LLM or TensorRT to optimize inference speed on NVIDIA GPUs.
I have already exported Intern-VIT(6B, pretrained, https://huggingface.co/OpenGVLab/InternViT-6B-224px) onnx with torch.onnx.export, and obtained one .onnx and many other weight files like this:
which occupy 22GB altogether. The onnx model passed onnx.checker and it can be opened via netron too.
Then, this error occurred when I tried to quantize Intern-VIT onnx model, while the example onnx_ptq/ can be executed normally:
Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.10/dist-packages/modelopt/onnx/quantization/__main__.py", line 133, in <module>
main()
File "/usr/local/lib/python3.10/dist-packages/modelopt/onnx/quantization/__main__.py", line 115, in main
quantize(
File "/usr/local/lib/python3.10/dist-packages/modelopt/onnx/quantization/quantize.py", line 207, in quantize
onnx_model = quantize_func(
File "/usr/local/lib/python3.10/dist-packages/modelopt/onnx/quantization/int8.py", line 186, in quantize
quantize_static(
File "/usr/local/lib/python3.10/dist-packages/onnxruntime/quantization/quantize.py", line 505, in quantize_static
calibrator = create_calibrator(
File "/usr/local/lib/python3.10/dist-packages/onnxruntime/quantization/calibrate.py", line 1155, in create_calibrator
calibrator.create_inference_session()
File "/usr/local/lib/python3.10/dist-packages/modelopt/onnx/quantization/ort_patching.py", line 194, in _create_inference_session
calibrator.infer_session = ort.InferenceSession(
File "/usr/local/lib/python3.10/dist-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 419, in __init__
self._create_inference_session(providers, provider_options, disabled_optimizers)
File "/usr/local/lib/python3.10/dist-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 472, in _create_inference_session
sess = C.InferenceSession(session_options, self._model_path, True, self._read_config_from_model)
onnxruntime.capi.onnxruntime_pybind11_state.InvalidProtobuf: [ONNXRuntimeError] : 7 : INVALID_PROTOBUF : Load model from /tmp/ort.quant.ofeo9ljp/augmented_model.onnx failed:Protobuf parsing failed.
I use NGC container (pytorch 24.07) to build the docker image, and here is my onnx version:
Also, the command I use is exactly the same as onnx_ptq/ example. Is this owing to large onnx model? How can I fix it?
I have already exported Intern-VIT(6B, pretrained, https://huggingface.co/OpenGVLab/InternViT-6B-224px) onnx with
torch.onnx.export
, and obtained one .onnx and many other weight files like this: which occupy 22GB altogether. The onnx model passedonnx.checker
and it can be opened via netron too.Then, this error occurred when I tried to quantize Intern-VIT onnx model, while the example onnx_ptq/ can be executed normally:
I use NGC container (pytorch 24.07) to build the docker image, and here is my onnx version:
Also, the command I use is exactly the same as onnx_ptq/ example. Is this owing to large onnx model? How can I fix it?