Closed GoogleCodeExporter closed 8 years ago
Neon is the key to libyuv performance on Arm. Its roughly 10x faster,
depending on the function.
I'm not set up to test WP8, but if someone could contribute the code, I could
help integrate it.
Would you, or a coworker, be able to do the conversion? The task, I'd hope,
would be a fairly straight forward conversion of row_neon.cc to row_win_neon.cc.
For row_win.cc vs row_posix.cc I start by compiling row_win.cc with Visual C
and disassembling with gcc's objdump.
I did a quick test with vs2013 and it still builds for WP9. Updated
instructions to show that.
Original comment by fbarch...@chromium.org
on 18 Mar 2014 at 9:14
We could definitely try to contribute. What would be the simplest method/file
that we could start with to test the process?
Original comment by pavel.pu...@gmail.com
on 18 Mar 2014 at 9:45
ignoring headers and build files, I'd start with the simpliest function you can
find in row_neon.cc
Like this one:
void RGB24ToARGBRow_NEON(const uint8* src_rgb24, uint8* dst_argb, int pix) {
asm volatile (
"vmov.u8 d4, #255 \n" // Alpha
".p2align 2 \n"
"1: \n"
"vld3.8 {d1, d2, d3}, [%0]! \n" // load 8 pixels of RGB24.
"subs %2, %2, #8 \n" // 8 processed per loop.
"vst4.8 {d1, d2, d3, d4}, [%1]! \n" // store 8 pixels of ARGB.
"bgt 1b \n"
: "+r"(src_rgb24), // %0
"+r"(dst_argb), // %1
"+r"(pix) // %2
:
: "cc", "memory", "d1", "d2", "d3", "d4" // Clobber List
);
}
You could also look at row_win.cc for the same function for intel/visual c to
get a rough idea of syntax.
__declspec(naked) __declspec(align(16))
void RGB24ToARGBRow_SSSE3(const uint8* src_rgb24, uint8* dst_argb, int pix) {
__asm {
mov eax, [esp + 4] // src_rgb24
mov edx, [esp + 8] // dst_argb
mov ecx, [esp + 12] // pix
pcmpeqb xmm5, xmm5 // generate mask 0xff000000
pslld xmm5, 24
movdqa xmm4, kShuffleMaskRGB24ToARGB
align 4
convertloop:
movdqu xmm0, [eax]
movdqu xmm1, [eax + 16]
movdqu xmm3, [eax + 32]
lea eax, [eax + 48]
movdqa xmm2, xmm3
palignr xmm2, xmm1, 8 // xmm2 = { xmm3[0:3] xmm1[8:15]}
pshufb xmm2, xmm4
por xmm2, xmm5
palignr xmm1, xmm0, 12 // xmm1 = { xmm3[0:7] xmm0[12:15]}
pshufb xmm0, xmm4
movdqa [edx + 32], xmm2
por xmm0, xmm5
pshufb xmm1, xmm4
movdqa [edx], xmm0
por xmm1, xmm5
palignr xmm3, xmm3, 4 // xmm3 = { xmm3[4:15]}
pshufb xmm3, xmm4
movdqa [edx + 16], xmm1
por xmm3, xmm5
sub ecx, 16
movdqa [edx + 48], xmm3
lea edx, [edx + 64]
jg convertloop
ret
}
}
See if that function ports to Visual C Neon.
Just attach the code here or email it, if you get something building, and I can
make a code review out of it.
Original comment by fbarch...@chromium.org
on 19 Mar 2014 at 1:22
inline is not supported, so armasm is likely the easiest translation?
http://msdn.microsoft.com/en-us/library/hh873189.aspx
Original comment by fbarch...@chromium.org
on 21 Mar 2014 at 1:45
Using intrinsics may be another option? Not a trivial port, but it may also
work for armv8 64 bit.
Original comment by phthor...@gmail.com
on 7 Jul 2014 at 10:46
I also use libyuv with winphone8. so I modified some neon function from inline
asm to arm neon function in .asm.
then it works well as fast as other platform.
Original comment by seewo...@gmail.com
on 12 Sep 2014 at 5:18
seewoo79, would you be able to send the .asm files and any header/build changes?
Original comment by fbarch...@google.com
on 26 Sep 2014 at 9:29
Wont have time to work on this in immediate future.
Try using gcc objects and/or clang-cl as work around?
Patches welcome!
File an issue if this is important.
Original comment by fbarch...@google.com
on 11 Feb 2015 at 12:34
Original issue reported on code.google.com by
pavel.pu...@gmail.com
on 18 Mar 2014 at 2:10