zhb-code commented 1 month ago

大佬，我按照文档执行了第10步导出模型，执行成功了，然后导出了一堆文件，原始模型用的是qwen2-1.5B-Instruct，modelscope下的，然后按照第十二步：模型的量化方法可以在文件夹 "Do_Quantize" 中查看。我改成我自己的路径，但是总是报各种各样的错误，能针对原始模型到onnx模型或者原始模型到org格式模型出个详细点的教程吗？我用的mac

DakeQQ commented 1 month ago

用q8_f32.py, 先尝试关闭所有"subprocess.run..."代码, 只留下:

import os
import onnx
import onnx.version_converter
from onnxsim import simplify
from onnxruntime.quantization import QuantType, quantize_dynamic

# Path Setting
original_folder_path = r"C:\Users\Downloads\Model_ONNX"                          # The original folder.
quanted_folder_path = r"C:\Users\Downloads\Model_ONNX_Quanted"                   # The quanted folder.
model_path = os.path.join(original_folder_path, "Model.onnx")                    # The original fp32 model path.
quanted_model_path = os.path.join(quanted_folder_path, "Model_quanted.onnx")     # The quanted model stored path.

2. # Start Quantize

quantize_dynamic(
    model_input=model_path,
    model_output=quanted_model_path,
    per_channel=True,                                        # True for model accuracy but cost a lot of time during quanting process.
    reduce_range=False,                                      # True for some x86_64 platform.
    weight_type=QuantType.QInt8,                             # Int8 is official recommended. No obvious difference between Int8 and UInt8 format.
    extra_options={'ActivationSymmetric': True,              # True for inference speed. False may keep more accuracy.
                   'WeightSymmetric': True,                  # True for inference speed. False may keep more accuracy.
                   'EnableSubgraph': True,                   # True for more quant.
                   'ForceQuantizeNoInputCheck': False,       # True for more quant.
                   'MatMulConstBOnly': False                 # False for more quant. Sometime, the inference speed may get worse.
                   },
    nodes_to_exclude=None,                                   # Specify the node names to exclude quant process. Example: nodes_to_exclude={'/Gather'}
    use_external_data_format=False                           # Save the model into two parts.
)

3. # ONNX Model Optimizer
model, _ = simplify(
    model=onnx.load(quanted_model_path),
    include_subgraph=True,
    dynamic_input_shape=False,          # True for dynamic input.
    tensor_size_threshold="1.99GB",        # Must less than 2GB.
    perform_optimization=True,          # True for more optimize.
    skip_fuse_bn=False,                 # False for more optimize.
    skip_constant_folding=False,        # False for more optimize.
    skip_shape_inference=False,         # False for more optimize.
    mutable_initializer=False           # False for static initializer.
)
onnx.save(model, quanted_model_path)

量化后, 启动终端手动运行代码转换为ort格式 (不是必需的, model.onnx也能在Android上成功运行):

python -m onnxruntime.tools.convert_onnx_models_to_ort --output_dir {quanted_folder_path} --optimization_style {optimization_style} --target_platform {target_platform} --enable_type_reduction {quanted_folder_path}']

如果可以，请粘贴更多错误信息 :)

注: pre-process代码不是必需的: subprocess.run([f'python -m onnxruntime.quantization.preprocess --auto_merge --all_tensors_to_one_file --input {model_path} --output {quant_model_path}'], shell=True)

DakeQQ commented 1 month ago

原始Pytorch模型到ONNX模型的教程

第10步导出模型，执行成功了，然后导出了一堆文件

这些文件就是 ONNX 模型，因 ONNX 仅支持最大 2GB 大小而被拆分。可以在导出的文件夹中找到 'Qwen.onnx' ，并使用 'Netron' 来可视化模型。

Qwen_Export.py即是教程。
Float32-ONNX模型 -> 量化 -> ORT格式

Do_Quantize即是教程。哪个步骤不够清楚呢？我们需要更多的反馈来完善细节 :)

DakeQQ / Native-LLM-for-Android

onnx模型转换问题 #5

原始Pytorch模型到ONNX模型的教程

Float32-ONNX模型 -> 量化 -> ORT格式