NVIDIA / TensorRT-Model-Optimizer

TensorRT Model Optimizer is a unified library of state-of-the-art model optimization techniques such as quantization, sparsity, distillation, etc. It compresses deep learning models for downstream deployment frameworks like TensorRT-LLM or TensorRT to optimize inference speed on NVIDIA GPUs.
https://nvidia.github.io/TensorRT-Model-Optimizer
Other
438 stars 27 forks source link

quant onnx failed when meeting Softmax #19

Closed tp-nan closed 3 months ago

tp-nan commented 3 months ago

Hi, when quant VILA 1.5: fp32 onnx => int8 onnx by:

python -m modelopt.onnx.quantization --quantize_mode int8 --verbose --onnx_path onnx/visual_encoder_fp32.onnx --calibration_data ./prefill/calib_imgs/results.npy 

Failed:

RuntimeError: Only an existing tensor can be modified, '/vision_tower/vision_tower/vision_model/encoder/layers.0/self_attn/Softmax_output_0' is not.

backtrace from onnxruntime:

            # Adjust Softmax to range from 0.0 to 1.0
            elif node.op_type == "Softmax":
                self.tensors_range[node.output[0]] = TensorData(lowest=np.float32(0.0), highest=np.float32(1.0))

How can i fix this

riyadshairi979 commented 3 months ago

Can you try --op_types_to_exclude=Softmax in the command line? I suspect this is due to some sort of side effect of shared input quantization. Also can you provide the ONNX model and full stacktrace?

tp-nan commented 3 months ago

Hi,

Can you try --op_types_to_exclude=Softmax

not work yet

Also can you provide the ONNX model and full stacktrace?

trace

Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.10/dist-packages/modelopt/onnx/quantization/__main__.py", line 138, in <module>
    main()
  File "/usr/local/lib/python3.10/dist-packages/modelopt/onnx/quantization/__main__.py", line 121, in main
    quantize(
  File "/usr/local/lib/python3.10/dist-packages/modelopt/onnx/quantization/quantize.py", line 447, in quantize
    quantize_static(
  File "/usr/local/lib/python3.10/dist-packages/onnxruntime/quantization/quantize.py", line 539, in quantize_static
    quantizer = QDQQuantizer(
  File "/usr/local/lib/python3.10/dist-packages/onnxruntime/quantization/qdq_quantizer.py", line 207, in __init__
    self.quantization_params = self.calc_graph_quant_params()
  File "/usr/local/lib/python3.10/dist-packages/onnxruntime/quantization/qdq_quantizer.py", line 1156, in calc_graph_quant_params
    self.adjust_tensor_ranges()
  File "/usr/local/lib/python3.10/dist-packages/onnxruntime/quantization/base_quantizer.py", line 504, in adjust_tensor_ranges
    self.tensors_range[node.output[0]] = TensorData(lowest=np.float32(0.0), highest=np.float32(1.0))
  File "/usr/local/lib/python3.10/dist-packages/onnxruntime/quantization/calibrate.py", line 127, in __setitem__
    raise RuntimeError(f"Only an existing tensor can be modified, {key!r} is not.")
  File "/usr/local/lib/python3.10/dist-packages/onnxruntime/quantization/calibrate.py", line 127, in __setitem__
    raise RuntimeError(f"Only an existing tensor can be modified, {key!r} is not.")

The fp16 onnx (SigLip for VILA1.5-3B) has emailed to you. FP32 onnx is too big. You can also get it from https://github.com/Efficient-Large-Model/VILA/blob/44a4cca98ac0f81b0891eb2341e9826b5553b6e8/demo_trt_llm/build_visual_engine.py#L95

The following script may also reproduce issue in #18

python -m modelopt.onnx.quantization --quantize_mode int8 --verbose --onnx_path visual_encoder.onnx

This may be also related to onnxruntime

riyadshairi979 commented 3 months ago

Please upgrade to modelopt 0.13. This issue have been fixed.

tp-nan commented 3 months ago

awesome!