DakeQQ / Native-LLM-for-Android

Demonstration of running a native LLM on Android device.
https://dakeqq.github.io/overview/
Apache License 2.0
63 stars 7 forks source link

onnx模型转换问题 #5

Closed zhb-code closed 1 month ago

zhb-code commented 1 month ago

大佬,我按照文档执行了第10步导出模型,执行成功了,然后导出了一堆文件,原始模型用的是qwen2-1.5B-Instruct,modelscope下的,然后按照第十二步:模型的量化方法可以在文件夹 "Do_Quantize" 中查看。我改成我自己的路径,但是总是报各种各样的错误,能针对 原始模型到onnx模型 或者 原始模型到org格式模型出个详细点的教程吗?我用的mac

DakeQQ commented 1 month ago
  1. 用q8_f32.py, 先尝试关闭所有"subprocess.run..."代码, 只留下:
import os
import onnx
import onnx.version_converter
from onnxsim import simplify
from onnxruntime.quantization import QuantType, quantize_dynamic

# Path Setting
original_folder_path = r"C:\Users\Downloads\Model_ONNX"                          # The original folder.
quanted_folder_path = r"C:\Users\Downloads\Model_ONNX_Quanted"                   # The quanted folder.
model_path = os.path.join(original_folder_path, "Model.onnx")                    # The original fp32 model path.
quanted_model_path = os.path.join(quanted_folder_path, "Model_quanted.onnx")     # The quanted model stored path.

2. # Start Quantize

quantize_dynamic(
    model_input=model_path,
    model_output=quanted_model_path,
    per_channel=True,                                        # True for model accuracy but cost a lot of time during quanting process.
    reduce_range=False,                                      # True for some x86_64 platform.
    weight_type=QuantType.QInt8,                             # Int8 is official recommended. No obvious difference between Int8 and UInt8 format.
    extra_options={'ActivationSymmetric': True,              # True for inference speed. False may keep more accuracy.
                   'WeightSymmetric': True,                  # True for inference speed. False may keep more accuracy.
                   'EnableSubgraph': True,                   # True for more quant.
                   'ForceQuantizeNoInputCheck': False,       # True for more quant.
                   'MatMulConstBOnly': False                 # False for more quant. Sometime, the inference speed may get worse.
                   },
    nodes_to_exclude=None,                                   # Specify the node names to exclude quant process. Example: nodes_to_exclude={'/Gather'}
    use_external_data_format=False                           # Save the model into two parts.
)

3. # ONNX Model Optimizer
model, _ = simplify(
    model=onnx.load(quanted_model_path),
    include_subgraph=True,
    dynamic_input_shape=False,          # True for dynamic input.
    tensor_size_threshold="1.99GB",        # Must less than 2GB.
    perform_optimization=True,          # True for more optimize.
    skip_fuse_bn=False,                 # False for more optimize.
    skip_constant_folding=False,        # False for more optimize.
    skip_shape_inference=False,         # False for more optimize.
    mutable_initializer=False           # False for static initializer.
)
onnx.save(model, quanted_model_path)
  1. 量化后, 启动终端手动运行代码转换为ort格式 (不是必需的, model.onnx也能在Android上成功运行):

python -m onnxruntime.tools.convert_onnx_models_to_ort --output_dir {quanted_folder_path} --optimization_style {optimization_style} --target_platform {target_platform} --enable_type_reduction {quanted_folder_path}']

  1. 如果可以,请粘贴更多错误信息 :)

注: pre-process代码不是必需的: subprocess.run([f'python -m onnxruntime.quantization.preprocess --auto_merge --all_tensors_to_one_file --input {model_path} --output {quant_model_path}'], shell=True)

DakeQQ commented 1 month ago

原始Pytorch模型到ONNX模型的教程

第10步导出模型,执行成功了,然后导出了一堆文件

  1. 这些文件就是 ONNX 模型,因 ONNX 仅支持最大 2GB 大小而被拆分。可以在导出的文件夹中找到 'Qwen.onnx' ,并使用 'Netron' 来可视化模型。
  2. Qwen_Export.py即是教程。

    Float32-ONNX模型 -> 量化 -> ORT格式

  3. Do_Quantize即是教程。哪个步骤不够清楚呢?我们需要更多的反馈来完善细节 :)