PaddlePaddle / Paddle-Lite

PaddlePaddle High Performance Deep Learning Inference Engine for Mobile and Edge (飞桨高性能深度学习端侧推理引擎)
https://www.paddlepaddle.org.cn/lite
Apache License 2.0
6.93k stars 1.61k forks source link

能否提供一下在android主流CPU上的性能数据,谢谢 #5

Closed nkyle04 closed 6 years ago

nkyle04 commented 6 years ago

只看到在IOS GPU上,squeezenet能跑到30ms。能否提供在android上的性能,这样可以对比跟其他框架的性能。 从代码上来看,ncnn使用neon指令实现了convolution,感觉要比这里直接使用gemm要快一些。

cocodark commented 6 years ago

我们是在gemm里面使用neon指令进行矩阵运算的: void Gemmer::dgemm_micro_kernel(int kc, float alpha, const float A, const float B, float beta, float *C, int incRowC, int incColC) {

ifndef MDL_MAC

    int i, j, l;
    float32x4_t abv0 = vdupq_n_f32(0);
    float32x4_t abv1 = vdupq_n_f32(0);
    float32x4_t abv2 = vdupq_n_f32(0);
    float32x4_t abv3 = vdupq_n_f32(0);

    float32x4_t av;
    float32x4_t bv;

    float32x2_t bv01;
    float32x2_t bv23;

    for (l = 0; l < kc; ++l) {
        av = vld1q_f32(A);
        bv = vld1q_f32(B);
        bv01 = vget_low_f32(bv);
        abv0 = vmlaq_lane_f32(abv0, av, bv01, 0);
        abv1 = vmlaq_lane_f32(abv1, av, bv01, 1);
        bv23 = vget_high_f32(bv);
        abv2 = vmlaq_lane_f32(abv2, av, bv23, 0);
        abv3 = vmlaq_lane_f32(abv3, av, bv23, 1);
        A += MR;
        B += NR;
    }

    vst1q_f32(AB_ + 0, abv0);
    vst1q_f32(AB_ + 4, abv1);
    vst1q_f32(AB_ + 8, abv2);
    vst1q_f32(AB_ + 12, abv3);

在小米6上,我们的性能如下: googlenet 均值360ms squeezenet 均值98ms mobilenet 均值360ms 由于Android机型众多,我们无法一一覆盖,供参考,谢谢!

nkyle04 commented 6 years ago

感谢分享。另外问一下输入图片是多大的?

allonli commented 6 years ago

224*224