Closed zhb-code closed 1 month ago
import os
import onnx
import onnx.version_converter
from onnxsim import simplify
from onnxruntime.quantization import QuantType, quantize_dynamic
# Path Setting
original_folder_path = r"C:\Users\Downloads\Model_ONNX" # The original folder.
quanted_folder_path = r"C:\Users\Downloads\Model_ONNX_Quanted" # The quanted folder.
model_path = os.path.join(original_folder_path, "Model.onnx") # The original fp32 model path.
quanted_model_path = os.path.join(quanted_folder_path, "Model_quanted.onnx") # The quanted model stored path.
2. # Start Quantize
quantize_dynamic(
model_input=model_path,
model_output=quanted_model_path,
per_channel=True, # True for model accuracy but cost a lot of time during quanting process.
reduce_range=False, # True for some x86_64 platform.
weight_type=QuantType.QInt8, # Int8 is official recommended. No obvious difference between Int8 and UInt8 format.
extra_options={'ActivationSymmetric': True, # True for inference speed. False may keep more accuracy.
'WeightSymmetric': True, # True for inference speed. False may keep more accuracy.
'EnableSubgraph': True, # True for more quant.
'ForceQuantizeNoInputCheck': False, # True for more quant.
'MatMulConstBOnly': False # False for more quant. Sometime, the inference speed may get worse.
},
nodes_to_exclude=None, # Specify the node names to exclude quant process. Example: nodes_to_exclude={'/Gather'}
use_external_data_format=False # Save the model into two parts.
)
3. # ONNX Model Optimizer
model, _ = simplify(
model=onnx.load(quanted_model_path),
include_subgraph=True,
dynamic_input_shape=False, # True for dynamic input.
tensor_size_threshold="1.99GB", # Must less than 2GB.
perform_optimization=True, # True for more optimize.
skip_fuse_bn=False, # False for more optimize.
skip_constant_folding=False, # False for more optimize.
skip_shape_inference=False, # False for more optimize.
mutable_initializer=False # False for static initializer.
)
onnx.save(model, quanted_model_path)
python -m onnxruntime.tools.convert_onnx_models_to_ort --output_dir {quanted_folder_path} --optimization_style {optimization_style} --target_platform {target_platform} --enable_type_reduction {quanted_folder_path}']
注: pre-process代码不是必需的:
subprocess.run([f'python -m onnxruntime.quantization.preprocess --auto_merge --all_tensors_to_one_file --input {model_path} --output {quant_model_path}'], shell=True)
第10步导出模型,执行成功了,然后导出了一堆文件
- 这些文件就是 ONNX 模型,因 ONNX 仅支持最大 2GB 大小而被拆分。可以在导出的文件夹中找到 'Qwen.onnx' ,并使用 'Netron' 来可视化模型。
- Qwen_Export.py即是教程。
Float32-ONNX模型 -> 量化 -> ORT格式
- Do_Quantize即是教程。哪个步骤不够清楚呢?我们需要更多的反馈来完善细节 :)
大佬,我按照文档执行了第10步导出模型,执行成功了,然后导出了一堆文件,原始模型用的是qwen2-1.5B-Instruct,modelscope下的,然后按照第十二步:模型的量化方法可以在文件夹 "Do_Quantize" 中查看。我改成我自己的路径,但是总是报各种各样的错误,能针对 原始模型到onnx模型 或者 原始模型到org格式模型出个详细点的教程吗?我用的mac