airockchip / rknn-toolkit2

Other
844 stars 88 forks source link

是否有计划支持更大转置算子,或者有没有优化方法 #88

Open kaylorchen opened 3 months ago

kaylorchen commented 3 months ago

Transpose will fallback to CPU, because input shape has exceeded the max limit, height(64) * width(1376) = 88064, required product no larger than 16384! 在测试depth_anything的时候,发现注意力机制那块的矩阵需要各种转置,但是因为维数太大不能npu转,导致推理速度很有问题。请问没有没有办法解决。 image

yuyun2000 commented 3 months ago

是的,这个很难搞,会导致转出的rknn模型尺寸很大。似乎把transpose手动分开,rknntoolkit也会自动把他们合起来...

kaylorchen commented 3 months ago

是的,这个很难搞,会导致转出的rknn模型尺寸很大。似乎把transpose手动分开,rknntoolkit也会自动把他们合起来...

确实是。我也尝试过拆成子矩阵。然后它又优化回去了

Vincent630 commented 1 month ago

你好,麻烦请教下转换rknn的时候用的是哪个版本的rknn的轮子呢?我用了rknn_toolkit2-2.1.0+708089d1-cp311-cp311-linux_x86_64.whl也是python3.11的最新的轮子但是转换没成功,跪求告一下,感谢

Vincent630 commented 1 month ago

Transpose will fallback to CPU, because input shape has exceeded the max limit, height(64) * width(1376) = 88064, required product no larger than 16384! 在测试depth_anything的时候,发现注意力机制那块的矩阵需要各种转置,但是因为维数太大不能npu转,导致推理速度很有问题。请问没有没有办法解决。 image

可以请教下你这边是在RK上部署的depth anythingv1吗还是V2呢?

yuyun2000 commented 1 month ago

v1吧 v1我搞上了 直接用现成的onnx就行

kaylorchen commented 1 month ago

Transpose will fallback to CPU, because input shape has exceeded the max limit, height(64) * width(1376) = 88064, required product no larger than 16384! 在测试depth_anything的时候,发现注意力机制那块的矩阵需要各种转置,但是因为维数太大不能npu转,导致推理速度很有问题。请问没有没有办法解决。 image

可以请教下你这边是在RK上部署的depth anythingv1吗还是V2呢?

v2

happyme531 commented 1 month ago

是的,这个很难搞,会导致转出的rknn模型尺寸很大。似乎把transpose手动分开,rknntoolkit也会自动把他们合起来...

确实是。我也尝试过拆成子矩阵。然后它又优化回去了

可以直接调用rknn模型转换器, 跳过优化

from rknn.api.rknn_compiler import RKNNCompiler, RKNNConfig, RKNNNormalize
from rknn.api.rknn_platform import support_soc_npu_target
from pprint import pprint

pprint(support_soc_npu_target)

onnx_model_path = "check3_fuse_ops.onnx"

RKNNCompiler.build(
    onnx_model_path,
    RKNNConfig(
        target="v2",  
        request_type="float16",
        optimize_options="compress=0, conv_eltwise_activation_fuse=1, global_fuse=1, multi-core-model-mode=7, output_optimize=1, layout_match=1, enable_argb_group=1, pipeline_fuse=0, enable_flash_attention=0",
        verbose_level=999,
    ),
    RKNNNormalize(
        channel_means=[[0, 0, 0]],
        channel_stds=[[1, 1, 1]],
        channel_orders=[[0, 1, 2]],  # not tested !!!
    ),
    "out.rknn",
)
kaylorchen commented 1 month ago

是的,这个很难搞,会导致转出的rknn模型尺寸很大。似乎把transpose手动分开,rknntoolkit也会自动把他们合起来...

确实是。我也尝试过拆成子矩阵。然后它又优化回去了

可以直接调用rknn模型转换器, 跳过优化

from rknn.api.rknn_compiler import RKNNCompiler, RKNNConfig, RKNNNormalize
from rknn.api.rknn_platform import support_soc_npu_target
from pprint import pprint

pprint(support_soc_npu_target)

onnx_model_path = "check3_fuse_ops.onnx"

RKNNCompiler.build(
    onnx_model_path,
    RKNNConfig(
        target="v2",  
        request_type="float16",
        optimize_options="compress=0, conv_eltwise_activation_fuse=1, global_fuse=1, multi-core-model-mode=7, output_optimize=1, layout_match=1, enable_argb_group=1, pipeline_fuse=0, enable_flash_attention=0",
        verbose_level=999,
    ),
    RKNNNormalize(
        channel_means=[[0, 0, 0]],
        channel_stds=[[1, 1, 1]],
        channel_orders=[[0, 1, 2]],  # not tested !!!
    ),
    "out.rknn",
)

就感觉这样的方式的话,可能会顾此失彼的样子

happyme531 commented 1 month ago

是的,这个很难搞,会导致转出的rknn模型尺寸很大。似乎把transpose手动分开,rknntoolkit也会自动把他们合起来...

确实是。我也尝试过拆成子矩阵。然后它又优化回去了

可以直接调用rknn模型转换器, 跳过优化

from rknn.api.rknn_compiler import RKNNCompiler, RKNNConfig, RKNNNormalize
from rknn.api.rknn_platform import support_soc_npu_target
from pprint import pprint

pprint(support_soc_npu_target)

onnx_model_path = "check3_fuse_ops.onnx"

RKNNCompiler.build(
    onnx_model_path,
    RKNNConfig(
        target="v2",  
        request_type="float16",
        optimize_options="compress=0, conv_eltwise_activation_fuse=1, global_fuse=1, multi-core-model-mode=7, output_optimize=1, layout_match=1, enable_argb_group=1, pipeline_fuse=0, enable_flash_attention=0",
        verbose_level=999,
    ),
    RKNNNormalize(
        channel_means=[[0, 0, 0]],
        channel_stds=[[1, 1, 1]],
        channel_orders=[[0, 1, 2]],  # not tested !!!
    ),
    "out.rknn",
)

就感觉这样的方式的话,可能会顾此失彼的样子

你先用正常的api正常build,然后把最后的中间结果check3_fuse_ops.onnx拿出来改,改完再用这个脚本编译不就行了

不过现在也找出来自定义优化的方法了。但是还没搞清楚怎么执行correct_ops()fold_constant(),所以目前这个脚本需要输入check2的onnx模型.

from rknn.api.rknn_log import RKNNLog
from rknn.api.ir_graph import IRGraph
from rknn.api.graph_optimizer import GraphOptimizer, convert_rules, fuse_rules, hardware_rules
import onnx
import numpy
import pprint
from os import environ

# pprint.pprint(convert_rules)
# pprint.pprint(fuse_rules)
# pprint.pprint(hardware_rules)

all_passes = []
for key in convert_rules:
    all_passes.append(key)
for key in fuse_rules:
    all_passes.append(key)
for key in hardware_rules:
    all_passes.append(key)

disable_passes = []
disable_passes.append("fuse_matmul_softmax_matmul_to_sdpa")
disable_passes.append("fuse_exmatmul_add_mul_exsoftmax13_exmatmul_to_sdpa")

enabled_passes = []
for pass_name in all_passes:
    if pass_name not in disable_passes:
        enabled_passes.append(pass_name)

logger = RKNNLog()
logger.reset_log_level_and_file_path('DEBUG', '/tmp/rknn.log')

onnx_model_path = "check2_correct_ops.onnx"
onnx_model = onnx.load(onnx_model_path)

ir = IRGraph(onnx_model, verbose=True, npu_target='v2')

optimizer = GraphOptimizer(ir, None, None,None)
# optimizer.correct_ops()
# optimizer.fold_constant()
optimizer.fuse_ops(passes_print=True, passes=enabled_passes)
# print(ir.calc_params_tflops())

(更新:加上语法高亮)