nkyle04 commented 6 years ago

只看到在IOS GPU上，squeezenet能跑到30ms。能否提供在android上的性能，这样可以对比跟其他框架的性能。从代码上来看，ncnn使用neon指令实现了convolution，感觉要比这里直接使用gemm要快一些。

cocodark commented 6 years ago

我们是在gemm里面使用neon指令进行矩阵运算的： void Gemmer::dgemm_micro_kernel(int kc, float alpha, const float A, const float B, float beta, float *C, int incRowC, int incColC) {

ifndef MDL_MAC

    int i, j, l;
    float32x4_t abv0 = vdupq_n_f32(0);
    float32x4_t abv1 = vdupq_n_f32(0);
    float32x4_t abv2 = vdupq_n_f32(0);
    float32x4_t abv3 = vdupq_n_f32(0);

    float32x4_t av;
    float32x4_t bv;

    float32x2_t bv01;
    float32x2_t bv23;

    for (l = 0; l < kc; ++l) {
        av = vld1q_f32(A);
        bv = vld1q_f32(B);
        bv01 = vget_low_f32(bv);
        abv0 = vmlaq_lane_f32(abv0, av, bv01, 0);
        abv1 = vmlaq_lane_f32(abv1, av, bv01, 1);
        bv23 = vget_high_f32(bv);
        abv2 = vmlaq_lane_f32(abv2, av, bv23, 0);
        abv3 = vmlaq_lane_f32(abv3, av, bv23, 1);
        A += MR;
        B += NR;
    }

    vst1q_f32(AB_ + 0, abv0);
    vst1q_f32(AB_ + 4, abv1);
    vst1q_f32(AB_ + 8, abv2);
    vst1q_f32(AB_ + 12, abv3);

在小米6上，我们的性能如下： googlenet 均值360ms squeezenet 均值98ms mobilenet 均值360ms 由于Android机型众多，我们无法一一覆盖，供参考，谢谢！

nkyle04 commented 6 years ago

感谢分享。另外问一下输入图片是多大的？

allonli commented 6 years ago

224*224

PaddlePaddle / Paddle-Lite

能否提供一下在android主流CPU上的性能数据，谢谢 #5

ifndef MDL_MAC