Windows Phone 8 NEON optimization

GoogleCodeExporter commented 9 years ago

Hi,

Thanks for you quick response to request about WP8 support.
It works flawlessly - except for performance.

Do you plan to support NEON optimization for WP8?

Thanks

Original issue reported on code.google.com by pavel.pu...@gmail.com on 18 Mar 2014 at 2:10

GoogleCodeExporter commented 9 years ago

Neon is the key to libyuv performance on Arm.  Its roughly 10x faster, 
depending on the function.

I'm not set up to test WP8, but if someone could contribute the code, I could 
help integrate it.

Would you, or a coworker, be able to do the conversion?  The task, I'd hope, 
would be a fairly straight forward conversion of row_neon.cc to row_win_neon.cc.

For row_win.cc vs row_posix.cc I start by compiling row_win.cc with Visual C 
and disassembling with gcc's objdump.

I did a quick test with vs2013 and it still builds for WP9.  Updated 
instructions to show that.

Original comment by fbarch...@chromium.org on 18 Mar 2014 at 9:14

GoogleCodeExporter commented 9 years ago

We could definitely try to contribute. What would be the simplest method/file 
that we could start with to test the process?

Original comment by pavel.pu...@gmail.com on 18 Mar 2014 at 9:45

GoogleCodeExporter commented 9 years ago

ignoring headers and build files, I'd start with the simpliest function you can 
find in row_neon.cc

Like this one:

void RGB24ToARGBRow_NEON(const uint8* src_rgb24, uint8* dst_argb, int pix) {
  asm volatile (
    "vmov.u8    d4, #255                       \n"  // Alpha
    ".p2align   2                              \n"
  "1:                                          \n"
    "vld3.8     {d1, d2, d3}, [%0]!            \n"  // load 8 pixels of RGB24.
    "subs       %2, %2, #8                     \n"  // 8 processed per loop.
    "vst4.8     {d1, d2, d3, d4}, [%1]!        \n"  // store 8 pixels of ARGB.
    "bgt        1b                             \n"
  : "+r"(src_rgb24),  // %0
    "+r"(dst_argb),   // %1
    "+r"(pix)         // %2
  :
  : "cc", "memory", "d1", "d2", "d3", "d4"  // Clobber List
  );
}

You could also look at row_win.cc for the same function for intel/visual c to 
get a rough idea of syntax.

__declspec(naked) __declspec(align(16))
void RGB24ToARGBRow_SSSE3(const uint8* src_rgb24, uint8* dst_argb, int pix) {
  __asm {
    mov       eax, [esp + 4]   // src_rgb24
    mov       edx, [esp + 8]   // dst_argb
    mov       ecx, [esp + 12]  // pix
    pcmpeqb   xmm5, xmm5       // generate mask 0xff000000
    pslld     xmm5, 24
    movdqa    xmm4, kShuffleMaskRGB24ToARGB

    align      4
 convertloop:
    movdqu    xmm0, [eax]
    movdqu    xmm1, [eax + 16]
    movdqu    xmm3, [eax + 32]
    lea       eax, [eax + 48]
    movdqa    xmm2, xmm3
    palignr   xmm2, xmm1, 8    // xmm2 = { xmm3[0:3] xmm1[8:15]}
    pshufb    xmm2, xmm4
    por       xmm2, xmm5
    palignr   xmm1, xmm0, 12   // xmm1 = { xmm3[0:7] xmm0[12:15]}
    pshufb    xmm0, xmm4
    movdqa    [edx + 32], xmm2
    por       xmm0, xmm5
    pshufb    xmm1, xmm4
    movdqa    [edx], xmm0
    por       xmm1, xmm5
    palignr   xmm3, xmm3, 4    // xmm3 = { xmm3[4:15]}
    pshufb    xmm3, xmm4
    movdqa    [edx + 16], xmm1
    por       xmm3, xmm5
    sub       ecx, 16
    movdqa    [edx + 48], xmm3
    lea       edx, [edx + 64]
    jg        convertloop
    ret
  }
}

See if that function ports to Visual C Neon.
Just attach the code here or email it, if you get something building, and I can 
make a code review out of it.

Original comment by fbarch...@chromium.org on 19 Mar 2014 at 1:22

GoogleCodeExporter commented 9 years ago

inline is not supported, so armasm is likely the easiest translation?
http://msdn.microsoft.com/en-us/library/hh873189.aspx

Original comment by fbarch...@chromium.org on 21 Mar 2014 at 1:45

GoogleCodeExporter commented 9 years ago

Using intrinsics may be another option?  Not a trivial port, but it may also 
work for armv8 64 bit.

Original comment by phthor...@gmail.com on 7 Jul 2014 at 10:46

GoogleCodeExporter commented 9 years ago

I also use libyuv with winphone8. so I modified some neon function from inline 
asm to arm neon function in .asm.

then it works well as fast as other platform.

Original comment by seewo...@gmail.com on 12 Sep 2014 at 5:18

GoogleCodeExporter commented 9 years ago

seewoo79, would you be able to send the .asm files and any header/build changes?

Original comment by fbarch...@google.com on 26 Sep 2014 at 9:29

GoogleCodeExporter commented 9 years ago

Wont have time to work on this in immediate future.
Try using gcc objects and/or clang-cl as work around?
Patches welcome!
File an issue if this is important.

Original comment by fbarch...@google.com on 11 Feb 2015 at 12:34

Changed state: WontFix

flykickbird / libyuv

Windows Phone 8 NEON optimization #318